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Hierarchical  models  are  popular  tools  for  the  modeling  of  clustered  data. 

A practical  limitation  to  the  use  of  these  models  is  that  the  likelihood  function 
typically  involves  an  intractable,  high  dimensional  integral.  Several  authors 
propose  deterministic  methods  to  approximate  the  intractable  likelihood  function. 
However,  it  has  been  shown  that  these  methods  do  not  always  work  well.  As 
an  alternative,  stochastic  estimation  methods  have  been  proposed.  Stochastic 
estimation  methods  use  simulation  to  approximate  an  intractable  integral.  Different 
stochastic  estimation  methods  exist.  In  this  dissertation  we  focus  on  five  different 
methods:  simulated  maximum  likelihood  (SML),  Monte  Carlo  EM  (MCEM)  and 
Monte  Carlo  Newton- Raphson  (MCNR),  as  well  as  stochastic  approximation  EM 
(SAEM)  and  stochastic  approximation  Newton-Raphson  (SANR). 

First  we  describe  the  five  individual  methods  and  point  out  limitations  and 
practical  difficulties  associated  with  each.  In  particular,  we  address  the  issue  of 

viii 


convergence,  convergence  rates,  and  stopping  rules,  as  well  as  the  choice  of  the 
starting  values  for  each  of  these  methods.  Next,  we  perform  analytical  as  well  as 
empirical  investigations  to  compare  the  efficiency  of  these  methods.  Specifically, 
we  derive  the  asymptotic  standard  errors  of  MCEM  and  SML  which  show  that  in 
most  practical  applications  of  hierarchical  models  SML  is  very  inefficient  relative  to 
MCEM.  A simulation  study  and  several  examples  support  these  analytical  results. 
We  also  consider  Quasi-Monte  Carlo  techniques  in  an  attempt  to  increase  the 
efficiency  of  SML.  Then  we  characterize  the  convergence  rate  of  MCEM  and  SAEM 
and  show  that  in  hierarchical  models  MCEM  typically  converges  at  a much  faster 
rate.  Our  simulations  support  that  this  fast  convergence  rate  can  result  in  a much 
more  efficient  use  of  the  total  simulation  amount  for  MCEM. 
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CHAPTER  1 
INTRODUCTION 

1.1  Statistical  Analysis  of  Clustered  Data 
In  many  areas  of  research  data  arise  in  clusters  or  groups.  For  example  in 
surveys  or  observational  studies  populations  often  have  a natural  hierarchical 
structure  which  leads  to  the  collection  of  clustered  data.  Langford  and  Bentham 
(1997)  study  sudden  infant  death  in  England  and  Wales  at  district,  county,  and 
regional  levels.  Clearly,  data  collected  from  one  district  form  a more  homogeneous 
group  than  data  from  the  remaining  districts.  This  is  also  true,  at  the  next  level, 
for  data  from  the  same  county.  Similarly,  Ramamurti  (2000)  uses  data  collected 
at  firm-,  industry-  and  country-levels  to  explain  why  emerging  economies  are 
privatizing  state-owned  enterprises,  whereas  Raudenbush  et  al.  (1995)  investigate 
psychological  changes  within  married-couples  based  on  data  collected  at  the 
personal  as  well  as  the  environmental  level.  Clustered  data  can  also  arise  from 
taking  measurements  repeatedly  on  the  same  subject.  In  longitudinal  studies, 
for  example,  data  are  collected  on  the  same  experimental  unit  over  time  or  under 
multiple  conditions  leading  to  clusters  of  data  from  the  same  subject. 

Statistical  inference  from  data  that  occur  in  clusters  is  complicated  by  the  fact 
that  observations  within  each  cluster  are  typically  correlated.  Simple  statistical 
models  assume  independence  between  the  observations  and  are  therefore  not 
appropriate  for  these  types  of  data.  For  example,  ANOVA  or  regression  models 
require  independent  measurements  and  can  lead  to  wrong  conclusions  if  the 
independence  assumption  is  violated. 
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Statistical  models  for  correlated  data,  and  software  packages  for  implementing 
these  models,  are  readily  available  when  the  data  consist  of  clustered  normal 
responses.  Normal  data  typically  occur  when  measurements  are  taken  on  a 
continuous  scale  (after  applying  a suitable  transformation).  However,  when  the 
data  is  non-normal,  modeling,  in  particular  model  fitting,  is  complicated  by 
computational  difficulties.  In  the  following  we  motivate  the  ideas  behind  models  for 
normal  and  non-normal  clustered  data. 

1.2  Models  for  Normal  Clustered  Data 
Normal  data  that  occur  in  clusters  is  often  modeled  using  the  linear  mixed 
model  (LMM).  The  LMM,  which  has  been  investigated  extensively  in  Searle 
et  al.  (1992),  extends  the  classical  linear  model  by  allowing  the  responses  to  be 
correlated.  In  particular,  it  assumes  that  the  correlation  among  observations  from 
the  same  cluster  arises  because  there  is  a natural  heterogeneity  across  clusters 
and  that  this  heterogeneity  can  be  represented  as  a probability  distribution.  To 
account  for  this  heterogeneity,  an  unobserved  random  variable  (a  “random  effect” ) 
from  the  probability  distribution  is  incorporated  additively  into  the  model.  Since 
observations  from  the  same  cluster  share  the  same  random  effect,  correlation  is 
induced  between  the  observations  within  the  cluster.  Estimates  of  the  parameters 
are  obtained  by  maximizing  the  marginal  likelihood,  which  requires  integrating 
the  joint  likelihood  over  the  random  effects.  When  the  random  effects  distribution 
is  assumed  to  be  normal,  the  marginal  likelihood  for  the  LMM  can  be  written  in 
closed  form,  making  estimation  straightforward.  Since  software  packages  for  fitting 
LMMs  are  readily  available,  these  models  are  popular  tools  for  clustered  normal 
data  in  practice. 
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Table  1.1:  Heights  Data 


Family  Number,  Gender  and  Height  in  Inches 


1 

F 

67 

1 

F 

66 

1 

F 

64 

1 

M 

71 

1 

M 

72 

2 

F 

63 

2 

F 

63 

2 

F 

67 

2 

M 

69 

2 

M 

68 

2 

M 

70 

3 

F 

63 

3 

M 

64 

4 

F 

67 

4 

F 

66 

4 

M 

67 

4 

M 

67 

4 

M 

69 

Example.  Table  1.1  is  taken  from  the  online  documentation  of  the  SAS  1 * 
Institute  (2000)  and  shows  data  on  the  heights  (y^)  of  18  individuals  classified 
according  to  family  i (i  — 1, 2, 3 or  4)  and  gender  j (j  = 1 if  gender=M  or  j — 2 
if  gender=F).  A traditional  two-way  analysis  of  variance  for  this  data  results  in  a 
model  of  the  form 

Vijk  — Po  + $ + Pf  + Pjj 3 + Ujki  (1-1) 

where  Pf  and  fij  denote  the  main  effects  from  family  i and  gender  j and  Pf° 
the  family  x gender  interaction  effect.  The  error  terms,  e^,  are  assumed  to  be 
independent  of  each  other  and  normally  distributed  with  mean  zero  and  variance 

ho- 
using SAS’s  procedure  PROC  GLM  to  fit  model  (1.1)  to  the  data,  we  obtain, 

for  example,  for  the  estimate  of  the  difference  of  the  main  gender  effects 

& - & = 1.1667  (StdErr=1.3229),  (1.2) 

resulting  in  a P- Value  of  0.3985. 

Two  of  the  important  assumptions  behind  model  (1.1)  are  that  the  data  are 
normally  distributed  and  independent.  For  the  data  in  Table  1.1,  the  normality 
assumption  is  probably  realistic  since  the  data  are  observed  heights.  However, 


1 The  official  SAS  homepage  on  the  web  (www.sas.com)  explains  that  “Back  in 

1976,  when  SAS  was  founded,  SAS  was  an  acronym  for  statistical  analysis  system. 

But  a lot  has  changed  since  1976.  [. . .]  That  is  why  today’s  SAS  is  no  longer  an 
acronym.  [. . . ] Rather,  it  is  a brand  name.” 
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the  assumption  of  independence  is  questionable.  Since  the  data  occur  in  clusters 
(families),  it  is  very  likely  that  observations  from  the  same  family  are  correlated, 
that  is,  not  independent. 

One  way  of  introducing  a correlation  structure  into  model  (1.1)  is  through  the 
use  of  random  effects.  Using  vector  notation,  we  can  write  (1.1)  in  the  form  of  the 
linear  model 

y-X/3  + e,  (1.3) 

where  y = (yx, . . . ,y4)'  is  a suitably  partitioned  response  vector  with  y*  containing 
the  responses  from  cluster  (family)  i,  (3  denotes  the  vector  of  fixed  effects,  X the 
corresponding  design  matrix  and  e ~ .A/^O,  crgl),  where  I is  the  identity  matrix. 

The  LMM  extends  (1.3)  to 

y = X/3  + Zu  + e,  (1.4) 

where  the  random  effect  u is  assumed  to  be  normally  distributed  with  mean  zero 
and  covariance  matrix  G,  independent  of  e,  and  Z denotes  the  design  matrix  for  u. 

Estimation  in  the  LMM  is  straightforward.  Notice  that  equation  (1.4)  implies 
that  marginally, 

y = X/3  + e*,  (1.5) 

where  e*  ~ M( 0,  V)  and  V = cTqI  + ZGZ'.  It  follows  from  equation  (1.5) 
that  the  marginal  likelihood  for  the  LMM  is  available  in  closed  form  and  thus 
estimates  for  (3,  <7%  and  G can  be  obtained  with  little  effort  (using,  e.g.,  the  EM  or 
Newton- Rap hson  algorithms  described  in  Chapter  3). 

The  LMM  also  allows  for  a flexible  correlation  structure  among  the  responses. 
For  example,  if  V is  of  block-diagonal  form,  V = BlockDiag(Vi, . . . , V4),  such 
that  Cov(yi,yi)  = V*  and  Cov(yj,yj)  = 0 (i  / j),  then  observations  within  the 
same  cluster  are  correlated  whereas  observations  between  different  clusters  are 
independent. 
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A simple  modification  of  model  (1.1)  accounts  for  the  clustering  of  the  data 
in  Table  1.1.  Declaring  the  main  family  effect  as  random  sets  up  a common 
correlation  among  all  observations  having  the  same  level  of  family.  Furthermore, 
declaring  the  family  x gender  interaction  effect  as  random  models  an  additional 
correlation  between  all  observations  that  have  the  same  level  of  both  family  and 
gender.  One  interpretation  of  this  effect  is  that  a female  in  a certain  family  exhibits 
more  correlation  with  the  other  female  in  that  family  than  with  other  males,  and 
likewise  for  males.  This  results  in  the  model 


Vij  = @0  + fij  + Ui  + Uij  + Cij,  (1-6) 

where  the  Ui  s and  u^’s  are  a random  sample  from  a standard  normal  distribution 
with  mean  zero  and  variances  of  and  erf,  respectively. 

Using  SAS’s  procedure  PROC  MIXED  to  fit  model  (1.6)  to  the  data,  the 
estimated  difference  of  the  gender  effects  becomes 

& - & = 3.3621  (StdErr=1.1923),  (1.7) 

resulting  in  a P- Value  = 0.0667,  which  is  quite  different  from  the  estimate  in  (1.2). 

1.3  Models  for  Non-Normal  Clustered  Data 
Linear  mixed  models  are  useful  tools  for  modeling  normal  clustered  data. 
However,  in  many  practical  situations  the  responses  are  not  normally  distributed, 
making  the  use  of  these  models  inappropriate.  Hierarchical  models  extend  the 
ideas  of  LMMs  to  more  general  types  of  data,  including  non-normal  responses.  One 
example  of  a hierarchical  model  that  currently  receives  a great  deal  of  attention 
in  the  literature  is  the  generalized  linear  mixed  model  (GLMM)  (Gilmour  et  al., 
1985;  Breslow  and  Clayton,  1993;  McCulloch,  1994,  1997).  The  GLMM  extends  the 
generalized  linear  model  (GLM)  (Nelder  and  Wedderburn,  1972;  McCullagh  and 
Nelder,  1989)  and  contains  both,  the  LMM  and  the  GLM,  as  a special  case. 
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The  GLM  is  a natural  extension  of  the  classical  linear  model  allowing  for  a 
larger  class  of  response  distributions.  Specification  of  a GLM  consists  of  three 
parts:  a random  component,  a systematic  component  and  a link  function.  The 
random  component  specifies  the  distribution  of  the  responses;  typically,  the  yi  s 
are  assumed  to  be  independent  with  a distribution  in  the  exponential  family  (e.g. 
normal,  binomial,  Poisson,  gamma,  etc.).  The  systematic  component  is  a linear 
function  of  the  covariates 

Vi  = x'/3,  (1-8) 

where  r?*  is  called  the  linear  predictor.  The  link  function  g(-)  is  a monotonic 
differentiable  function  which  relates  the  mean  response  to  the  linear  predictor 

Vi  ~ g(E[yi})-  (!-9) 

When  the  response  distribution  is  normal  and  the  link  is  the  identity  function,  the 
GLM  reduces  to  the  classical  linear  model. 

A limitation  of  GLMs  is  that  they  assume  independent  responses.  Just  as 
modifications  of  classical  linear  models  are  used  to  analyze  correlated  normal 
responses,  modifications  of  the  GLM  can  handle  non-normal  outcomes.  The 
GLMM  extends  the  GLM  to  correlated  responses  through  the  use  of  random 
effects.  In  a GLMM  one  includes  random  effects  additively  into  the  linear  predictor 
and  assumes  that,  conditional  on  the  random  effects,  the  response  follows  a GLM. 
Specification  of  a GLMM  is  completed  by  assuming  a marginal  distribution  for  the 
random  effects  (see  Section  2.1.1  for  a more  detailed  description  of  the  GLMM). 

Although  hierarchical  models  have  great  intuitive  appeal,  their  practical  use 
is  often  limited  by  computational  difficulties.  For  example  in  the  GLMM,  the 
assumption  of  normally  distributed  random  effects  typically  leads  to  an  analytically 
intractable  likelihood,  making  maximum  likelihood  estimation  complicated. 
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Table  1.2:  Weil  Data 


Treatment,  liter  number  and  litter  size  after  4 and  21  days 


c 

1 

13 

13 

c 

2 

12 

12 

c 

3 

9 

9 

c 

4 

9 

9 

c 

5 

8 

8 

c 

6 

8 

8 

c 

7 

13 

12 

c 

8 

12 

11 

c 

9 

10 

9 

c 

10 

10 

9 

c 

11 

9 

8 

c 

12 

13 

11 

c 

13 

5 

4 

c 

14 

7 

5 

c 

15 

10 

7 

c 

16 

10 

7 

t 

1 

12 

12 

t 

2 

11 

11 

t 

3 

10 

10 

t 

4 

9 

9 

t 

5 

11 

10 

t 

6 

10 

9 

t 

7 

10 

9 

t 

8 

9 

8 

t 

9 

9 

8 

t 

10 

5 

4 

t 

11 

9 

7 

t 

12 

7 

4 

t 

13 

10 

5 

t 

14 

6 

3 

t 

15 

10 

3 

t 

16 

7 

0 

Example.  Table  1.2  shows  data  from  Weil  (1970),  also  studied  by  McCulloch 
(1994).  In  this  experiment  16  pregnant  rats  received  a control  diet  (c)  and  16 
received  a chemically  treated  diet  (t).  The  size  for  the  resulting  32  litters  was 
recorded  after  4 and  21  days. 

In  order  to  model  the  survival  time,  y^k,  of  a newborn  rat  k in  litter  j 
receiving  treatment  i,  McCulloch  (1994)  assumed  a model  of  the  form 

Vijk  — fii  T Uij  + €ijhi  (1.10) 

where  (3 , is  the  treatment  mean,  Uij  a random  effect  due  to  litter  j receiving 
treatment  i,  and  residual  error.  He  assumed  further  that  the  s and  the 
tijk  s are  random  samples  from  a normal  distribution  with  mean  zero  and  variances 
o\  and  1,  respectively.  Notice  that  the  inclusion  of  the  random  effect  into  the 
model  allows  the  survival  times  of  two  newborn  rats  from  the  same  litter  to  be 
correlated  with  each  other. 

Notice,  however,  that  instead  of  observing  the  survival  times  yijk  directly,  the 
researcher  only  observes  whether  or  not  rat  k from  litter  j under  treatment  i dies 
between  day  4 and  21.  That  is,  he  only  observes  a Bernoulli  response,  say 
where 


Wi-jk  — 


0 if  Vijk  < 7 

1 if  yijk  > 7 


(1.11) 


8 

and  7 is  a threshold  value.  Without  loss  of  generality,  assume  7 = 0.  If  $(•) 
denotes  the  standard  normal  cdf;  then  equation  (1.10)  implies  that  conditional  on 

Uij  1 

7 Xij  = Prob (wijk  — 1| Uij)  — -I-  Uij).  (1.12) 

Notice  that,  conditional  on  u^,  the  model  in  (1.10)  and  (1.11)  defines  a GLM 
with  independent  Bernoulli  responses  w^k,  the  linear  predictor  r^jk  = /?»  4-  up  and 
the  function  g(-)  = <f>-1(-)  which  links  the  linear  predictor  to  the  (conditional) 
mean,  E[u)ijk\uij]  — 7ty.  The  assumption  of  normally  distributed  random  effects, 
u^,  completes  the  specification  of  the  GLMM. 

Maximum  likelihood  estimation  for  this  model,  however,  is  complicated  by 
the  fact  that  the  likelihood  function  is  analytically  intractable.  If  a?y  = w^k 
and  if  denotes  the  litter  size  after  day  4,  then,  conditional  on  , Xij\uij  ~ 
Binomial(7Tjj,  m,j).  Let  w denote  the  vector  of  observed  responses  and  u the  vector 
of  random  effects.  The  (marginal)  likelihood  of  the  observed  data  is  obtained  by 
integrating  out  the  random  effects  in  the  joint  likelihood  of  w and  u, 

L(/3.fJi|w)  oc  Y[  f 7rJ°  (l  - 7Ty )m‘J' ~Xii <p(uij/ a 1 )duij , (1.13) 

*d 

where  tp(-)  denotes  the  standard  normal  pdf.  The  maximum  likelihood  estimates 
of  (3  and  a\  are  obtained  by  maximizing  L{^,o\\w).  However,  estimation  is  not 
straightforward,  since  the  integral  in  (1.13)  does  not  have  a closed  form  solution. 

1.4  Maximum  Likelihood  Computation  in  Hierarchical  Models 
Since  maximum  likelihood  estimation  in  hierarchical  models  is  complicated  by 
the  fact  that  the  likelihood  function  typically  involves  an  analytically  intractable 
integral,  much  of  the  work  in  this  area  has  focused  on  computational  aspects. 

In  the  following  we  review  different  approaches  in  the  literature  for  maximum 
likelihood  computation. 
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In  the  early  literature,  Williams  (1982)  proposed  an  iterative  scheme  based 
on  a quasi-likelihood  approach  for  estimating  the  parameters  in  a GLMM.  His 
approach,  however,  only  worked  well  when  the  variability  of  the  random  effects  was 
small. 

Anderson  and  Aitkin  (1985)  suggested  the  use  of  numerical  integration  to 
approximate  the  intractable  integral.  In  particular,  they  suggested  using  Gauss- 
Hermite  quadrature  within  the  context  of  an  EM  algorithm  to  approximate  the 
mode  of  the  likelihood.  Liu  and  Pierce  (1994)  and  Pinheiro  and  Bates  (1995) 
suggested  using  quadrature  methods  to  approximate  the  likelihood  directly. 
However,  a major  disadvantage  of  numerical  integration  is  that  the  error  of 
approximation  increases  with  the  dimension  of  the  integral  (e.g.  Evans  and  Swartz, 
1995).  It  is  therefore  not  recommended  for  multivariate  random  effects  with  a 
complex  correlation  structure.  Moreover,  establishing  exactly  how  the  error  (in 
the  approximation  of  the  integral)  propagates  through  the  maximization  can  be 
difficult  (Crouch  and  Spiegelman,  1990). 

Several  authors  proposed  to  approximate  the  intractable  likelihood  function 
analytically.  The  model  fitting  algorithms  of  Breslow  and  Clayton  (1993)  and 
Wolfinger  and  O’Connell  (1993)  are  essentially  derived  from  a Laplace  approxima- 
tion to  the  intractable  likelihood  integral.  Their  algorithms  involve  iterative  fitting 
of  LMMs  to  the  data  and  work  well  for  observations  that  can  be  approximated 
well  by  a normal  distribution  (e.g.,  Binomial  data  with  large  cell  counts).  However, 
recent  research  has  shown  that  these  pseudo-likelihood  estimates  are  substantially 
biased  in  some  circumstances  (Kuk,  1995;  Breslow  and  Lin,  1995;  Lin  and  Breslow, 
1996;  Jiang,  1998). 

There  are  several  approaches  in  the  literature  to  choose  the  random  effects 
distribution  such  that  the  likelihood  is  available  in  closed  form.  A number  of  such 
models  (e.g.,  the  beta-binomial  or  the  Poisson-gamma  model)  are  found  in  the 


10 


literature  on  over-dispersion  (Stein,  1988;  Conaway,  1990;  Hinde  and  Dernetrio, 
1998).  A common  criticism  of  this  approach  is  that  distributions  chosen  in  this  way 
do  not  allow  for  modeling  complex  interdependence  between  multivariate  random 
effects. 

The  choice  of  the  random  effects  distribution  can  influence  the  parameter 
estimates  of  the  fixed  effects  (see  Heckman  and  Singer,  1984;  Neuhaus  et  al.,  1992). 
This  has  led  to  recent  work  focusing  on  non-parametric  approaches  for  fitting 
hierarchical  models  (Follmann  and  Lambert,  1989;  Aitkin,  1996;  Friedl,  1998; 

Aitkin,  1999).  Such  approaches  assume  that  the  random  effects  follow  a discrete 
distribution  with  unknown  masses  and  mass  points.  However,  Aitkin  (1999)  and 
Heckman  and  Singer  (1984)  noted  that  this  approach  is  primarily  useful  when  the 
random  effects  distribution  is  a nuisance  rather  than  of  direct  interest,  since  the 
non-parametric  estimate  of  that  distribution  may  be  poor  even  for  large  samples. 

The  focus  of  this  dissertation  is  on  stochastic  estimation  methods.  Stochastic 
estimation  methods  are  computer  intensive  techniques  that  have  become  popular 
with  the  development  of  powerful  computing  facilities.  Stochastic  estimation 
methods  use  simulation  to  approximate,  for  example,  intractable  integrals.  In  the 
field  of  statistics,  integrals  can  often  be  represented  as  an  expectation  of  the  form 

E[h(U)\  = I h(u)f(u)du,  (1.14) 

for  a random  variable  U with  density  /(•).  If  one  can  generate  (or  simulate)  a 
sample  U\, . . . , uM  from  /,  then  the  integral  in  (1.14)  can  be  approximated  by  the 
empirical  average 

1 M 

J7  (L15) 

fc=i 

This  approach  is  often  referred  to  as  Monte  Carlo  integration.  An  excellent 
introduction  into  Monte  Carlo  methods  can  be  found  in  Robert  and  Casella  (1999). 
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Several  authors  have  suggested  simulation  to  approximate  the  value  of  the 
likelihood  in  hierarchical  models  (Geyer  and  Thompson,  1992;  Gelfand  and  Carlin, 
1993;  McCulloch,  1997).  They  propose  to  estimate  the  intractable  likelihood 
integral  by  an  empirical  average,  similar  to  (1.15),  and  obtain  the  parameter 
estimate  by  numerically  maximizing  this  estimate.  This  approach  can  often  be 
found  in  the  literature  under  the  name  of  simulated  maximum  likelihood  (SML). 
SML  is  especially  popular  in  economics  where  much  of  the  work  on  this  topic  has 
been  done  (Lee,  1992,  1995,  1998,  1999;  Gourieroux  and  Monfort,  1993;  Moon  and 
Stotsky,  1993;  Danielsson,  1994;  Crepon  and  Duguet,  1997;  Liesenfeld,  1998). 

An  alternative  to  approximating  the  likelihood  function  directly  is  to  compute 
only  its  maximum.  The  EM  algorithm  (Dempster  et  ah,  1977;  Wu,  1983)  iterates 
between  expectation  and  maximization  steps  and  produces  a sequence  of  estimates 
that  converges  to  a (local)  maximum  of  the  likelihood.  In  hierarchical  models, 
however,  the  expectation  step  is  typically  not  available  in  closed  form.  This  has  led 
to  the  development  of  the  Monte  Carlo  EM  (MCEM)  algorithm  (Wei  and  Tanner, 
1990;  McCulloch,  1994,  1997;  Chan  and  Ledolter,  1995;  Booth  and  Hobert,  1999; 
Natarajan  et  ah,  2000),  in  which  the  intractable  expectation  is  replaced  by  a Monte 
Carlo  approximation.  One  difficulty  associated  with  MCEM  is  that  it  does  not 
converge  unless  one  increases  the  Monte  Carlo  sample  size,  M,  as  the  algorithm 
progresses. 

Another  popular  method  for  computing  the  maximum  of  a function  is  the 
Newton-Raphson  algorithm.  Given  a current  parameter  estimate,  Newton-Raphson 
uses  the  score  function  and  the  Hessian  matrix  to  compute  the  next  parameter 
update.  However,  since  the  score  function  is  defined  as  the  first  derivative  of  the 
log-likelihood,  in  hierarchical  models  Newton-Raphson  typically  involves  evaluation 
of  an  intractable  integral.  The  Monte  Carlo  Newton-Raphson  (MCNR)  algorithm 
(Kuk  and  Cheng,  1997;  McCulloch,  1997;  Gauderman  and  Navidi,  2001)  replaces 
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this  integral  by  a Monte  Carlo  approximation.  Similar  to  MCEM,  MCNR  requires 
M to  be  increased  successively  for  convergence. 

A question  that  typically  remains  unanswered  for  both  algorithms,  MCEM 
and  MCNR,  is  how  M should  be  increased.  Wei  and  Tanner  (1990)  noted  that 
it  is  inefficient  to  start  with  a large  Monte  Carlo  sample  size  when  the  current 
parameter  update  is  far  from  the  maximizer  of  the  likelihood.  They  went  on  to 
suggest  that  M should  be  increased  with  the  number  of  iterations  but  did  not 
say  exactly  how  this  should  be  done.  Many  authors  (including  Wei  and  Tanner) 
suggest  that  the  decision  to  increase  M should  be  made  after  inspecting  a plot  of 
the  parameter  updates  versus  the  iteration  number.  Clearly,  this  decision  becomes 
more  difficult  when  the  dimension  of  the  parameter  increases.  Moreover,  it  does 
not  provide  a way  for  automatically  increasing  M.  Booth  and  Hobert  (1999)  are 
the  first  to  develop  an  automated  MCEM  algorithm.  They  suggested  to  increase 
the  Monte  Carlo  sample  size  when  the  current  parameter  update  is  swamped 
by  Monte  Carlo  error.  Their  argument  for  increasing  M,  however,  is  valid  only 
for  independent  samples,  ui, , uM-  In  situations  in  which  it  is  not  possible  to 
generate  independent  samples,  the  question  about  the  optimal  choice  of  M still 
remains. 

On  the  other  hand,  there  exist  versions  of  the  above  two  algorithms  that 
converge  with  constant  M.  These  versions  are  based  on  the  ideas  of  stochastic 
approximation  (Robbins  and  Monro,  1951;  Kiefer  and  Wolfowitz,  1952).  In 
particular,  Delyon  et  al.  (1999)  consider  a stochastic  approximation  version  of  the 
EM  (SAEM)  algorithm.  Similarly,  Ruppert  (1985)  and  Gu  and  Li  (1998)  discuss 
stochastic  approximation  versions  of  Newton-Raphson  (SANR).  The  appeal  of  these 
versions  is  that  they  do  not  require  any  additional  effort  to  decide  whether  and 
when  to  increase  M,  which,  at  first  glance,  appears  to  be  an  advantage  over  MCEM 
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or  MCNR.  There  are,  however,  disadvantages  associated  with  SAEM  and  SANR,  as 
we  will  explain  later  in  this  work. 

Most  of  the  literature  on  stochastic  estimation  methods,  with  the  exception  of 
McCulloch  (1997),  focuses  on  one  individual  method  only,  making  no  attempt  to 
compare  competing  approaches.  McCulloch  (1997,  p.165)  compares  SML,  MCEM 
and  MCNR  empirically.  He  finds  in  simulations  that  “SML  performs  poorly  [...]. 
SML  is  much  slower  than  either  MCEM  or  MCNR,  but  it  converges  to  a value 
much  further  from  the  MLE.”  McCulloch’s  investigations  are  certainly  a first  step 
towards  determining  the  most  suitable  estimation  method  for  hierarchical  models. 
However,  his  findings  are  of  empirical  nature  only.  Moreover,  they  do  not  include 
SAEM  and  SANR,  making  broader  and  more  theoretical  investigations  desirable. 

1.5  Outline 

The  purpose  of  this  dissertation  is  to  compare  the  performance  of  stochastic 
estimation  methods.  We  conduct  empirical  as  well  as  analytical  investigations 
to  compare  the  efficiency  of  SML,  MCEM,  MCNR,  SAEM  and  SANR.  This 
dissertation  is  organized  as  follows. 

In  Chapter  2 we  describe  the  general  hierarchical  model  and  its  wide  range  of 
applicability.  We  describe  some  important  special  cases  of  this  model  and  illustrate 
them  on  several  examples.  Then  we  discuss  maximum  likelihood  estimation  and 
problems  associated  with  it.  We  present  the  marginal  likelihood  function  and  argue 
that  it  typically  involves  an  analytically  intractable  integral. 

Chapter  3 concentrates  on  deterministic  methods  to  compute  the  intractable 
likelihood.  First  we  introduce  the  ideas  behind  numerical  integration  and  analytical 
approximation  to  the  intractable  integral  and  point  out  limitations  to  these  two 
approaches.  Then  we  discuss  Newton-Raphson  and  the  EM  algorithm.  We  outline 
the  ideas  of  the  two  algorithms  and  illustrate  them  on  several  examples.  We 
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compare  the  two  algorithms  analytically  and  conclude  this  chapter  by  describing 
important  modifications  of  EM. 

In  Chapter  4 we  consider  stochastic  methods  to  compute  the  intractable 
likelihood.  First  we  introduce  the  general  ideas  of  Monte  Carlo  integration.  Then 
we  describe  SML  and  point  out  difficulties  associated  with  it.  We  proceed  by 
discussing  MCEM  and  MCNR.  We  illustrate  both  algorithms  on  several  examples, 
discuss  convergence  and  the  choice  of  the  starting  values.  Then  we  introduce  the 
general  ideas  of  stochastic  approximation  and  describe  SAEM  and  SANR.  We  use 
several  examples  to  illustrate  both  algorithms  and  discuss  the  difference  to  their 
Monte  Carlo  versions.  We  conclude  by  addressing  convergence  rates  and  stopping 
rules  for  stochastic  approximation  algorithms. 

Chapter  5 concentrates  on  the  comparison  of  MCEM  and  SML.  We  derive 
the  asymptotic  Monte  Carlo  standard  errors  of  MCEM  and  SML  analytically  and 
use  this  result  to  investigate  the  efficiency  of  these  two  methods.  We  conduct  a 
simulation  study  and  illustrate  our  findings  on  several  examples. 

Chapter  6 focuses  on  efficiency  improvement  using  Quasi-Monte  Carlo.  First 
we  describe  the  ideas  of  Quasi-Monte  Carlo  and  its  difference  to  classical  Monte 
Carlo.  Then  we  demonstrate  how  to  apply  Quasi-Monte  Carlo  to  hierarchical 
models  and  conduct  a simulation  study  to  investigate  the  efficiency  improvement  of 
SML  using  Quasi-Monte  Carlo  over  classical  Monte  Carlo. 

Chapter  7 concentrates  on  the  comparison  of  MCEM  and  SAEM.  We  charac- 
terize the  convergence  rate  of  MCEM  and  SAEM  in  the  univariate  as  well  as  in  the 
multivariate  parameter  case.  We  use  this  result  to  investigate  the  efficiency  of  these 
two  methods.  We  conduct  a simulation  study  and  use  several  examples  to  illustrate 
our  results. 

The  dissertation  concludes  with  a summary  of  the  most  important  findings  and 
a discussion  of  future  research  topics  (Chapter  8). 


CHAPTER  2 

MODEL  AND  LIKELIHOOD 


2.1  The  General  Hierarchical  Model  (GHM) 

Simple  statistical  models  often  assume  independence  among  the  observations. 
For  example,  logistic  regression,  which  is  useful  for  modeling  binomial  data, 
requires  the  responses  to  be  independent  of  each  other.  In  many  applications, 
however,  the  dependence  structure  of  the  data  is  more  complex.  In  particular, 
observations  are  often  clustered,  with  observations  within  clusters  being  correlated. 
In  the  social  sciences,  for  example,  the  performance  of  students  within  the  same 
school  or  school  district  tend  to  be  positively  correlated.  A medical  researcher  will 
typically  find  that  the  responses  of  patients  to  a treatment  are  similar  within  the 
same  treatment-center.  In  economics,  unemployment  rates  tend  to  be  correlated 
within  the  same  region.  More  generally,  if  repeated  observations  are  taken  within  a 
sampling  unit  then  they  are  typically  (positively)  correlated. 

Hierarchical  models  are  popular  tools  that  enable  the  researcher  to  account 
for  correlation  within  a set  of  observations.  A recent  introduction  to  hierarchical 
modeling  from  a statistical  point  of  view  can  be  found  in  Hobert  (2000).  Following 
his  review,  our  starting  point  will  be  the  general  two-stage  general  hierarchical 
model  (GHM). 

Let  y = (j/i, . . . , yn)  be  an  n- vector  of  responses  and  let  u = (mx,  . . . , uq)  be 
a g-vector  of  unobserved  random  effects  or  missing  data.  Let  i/>  = (-*/? ^ , i/j'2)'  be  a 
5-vector  of  unknown  parameters.  At  the  first  stage  of  the  hierarchy  the  conditional 
density  of  y given  u is  assumed  to  be  of  a specific  parametric  form,  /(y|u;  •01). 
Then,  at  the  second  stage  of  the  hierarchy,  the  marginal  density  of  u,  denoted  by 
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<?(u;  ip 2),  is  specified.  Let  /( y,  u;  ip)  = /(y|u;  ipi)g( u;  ip2)  be  the  joint  density  of 
the  observed  and  unobserved  data  and  let  f(y,ip)  denote  the  marginal  density  of 
the  responses. 

In  the  following  we  discuss  several  special  cases  of  the  GHM.  These  are  the 
following:  the  one-way  balanced  mixed  model  (OWMM),  the  linear  mixed  model 
(LMM)  and  the  generalized  linear  mixed  model  (GLMM).  Using  the  notation  “A 
C B”  to  express  that  model  A is  a special  case  of  model  B,  the  ordering  of  these 
models  is 

OWMM  C LMM  C GLMM  C GHM. 


2.1.1  The  Generalized  Linear  Mixed  model  (GLMM) 

A special  case  of  the  GHM  which  currently  receives  a great  deal  of  attention 
is  the  GLMM  (Gilmour  et  al.,  1985;  Breslow  and  Clayton,  1993;  McCulloch,  1994, 
1997).  The  GLMM  extends  the  generalized  linear  model  (GLM)  by  allowing  for 
additional  components  of  variability  due  to  unobservable  random  effects. 

Specifically,  following  Booth  and  Hobert  (1999),  we  let  x;  and  z ; be  p-  and 
^-vectors  of  covariates  associated  with  the  fth  response  t/i,  i — 1 , . . . ,n.  Let  X 
and  Z be  the  corresponding  ( n x p)  and  (n  x q)  model  matrices.  We  assume 
that,  conditional  on  u,  the  data  y arise  from  a GLM  with  linear  predictors, 

Pi  — x'/3  + z'u,  where  f3  is  a p-vector  of  unknown  regression  coefficients.  Writing 
Pi  = E(yi |u)  for  the  conditional  mean  of  the  ith  response,  we  assume  that  pi  and 
the  linear  predictor  rp  satisfy  g(pi)  = rp,  where  the  link  function  g may  be  any 
monotonic  differentiable  function.  The  responses  are  assumed  to  be  conditionally 
independent,  stemming  from  an  exponential  family 


/(2/i|u;/3,0o)  = exP 


ViOi  ~ b(Qj) 


+ c(pi,cr2) 


(2.1) 


The  mean  and  the  canonical  parameter  are  related  through  the  equation  pi  — b'(9i) 
(see  McCullagh  and  Nelder,  1989,  Chapter  2).  The  likelihood  function  for  the 
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conditional  generalized  linear  model,  in  which  the  effects  vector  is  treated  as  fixed 
but  unknown  parameter,  is  given  by 

n 

/(y|u;  P,  cr2o)  = fj  /(y<|u,  p,  a$). 

i= 1 

Specification  of  the  model  is  completed  by  assuming  that  u is  a g-variate  random 
variable  with  a parametric  density,  g(u;  V>2). 

In  a GLMM  one  typically  assumes  that  u is  a mean-zero  multivariate  normal 
random  variable  with  covariance  matrix  G = G(^2)i  where  ^2  denotes  a vector  of 
variance  components.  (To  emphasize  that  \J>2  is  a vector  of  variance  components 
many  authors  write  cr\  in  place  of  ip2.  We  will  adopt  this  notation  in  the  following 
chapters).  However,  other  distributions  for  u are  possible.  In  particular,  Lee 
and  Nelder  (1996)  consider  distributions  conjugate  to  that  of  the  response.  This 
approach  leads  to  the  hierarchical  generalized  linear  model  (HGLM). 

The  following  examples  are  special  cases  of  the  GLMM  and  HGLM.  First, 
we  present  an  example  of  a GLMM  that  is  often  found  useful  to  model  clustered 
binary  data. 

Logistic-Normal  Model.  Hierarchical  models  are  popular  tools  for  educational 
data  (see,  e.g.,  Woodhouse  et  ah,  1996;  Raudenbush  and  Bryk,  1986).  For  example, 
suppose  that  in  a study  of  factors  that  affect  school  performance,  data  are  collected 
from  m randomly  selected  schools.  The  m schools  might  be  a random  sample  from 
the  collection  of  all  schools  in  the  county,  region  or  school  district.  Within  each 
school,  r students  are  randomly  selected  and  their  performance  (S=successful  or 
U=unsuccessful)  over  the  last  school  year  is  recorded  (see  Figure  2.1).  Thus  there 
is  a binary  response,  ytj,  for  student  j (j  — 1, . . . , r)  in  school  i ( i = 1, . . . , m), 
where  = 1 or  0 depending  on  whether  the  student  is  successful  or  unsuccessful, 
respectively.  Suppose  further  that  there  is  a vector  of  explanatory  variables 
(such  as  socioeconomic  factors)  associated  with  each  student.  If  the  researcher  is 
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Figure  2.1:  Student  school  performance 


interested  in  modeling  the  probability  of  a successful  student  performance,  logistic 
regression  is  a natural  model  choice.  However,  observations  within  the  same  school 
are  often  correlated  due  to  unobserved  effects  like,  for  example,  school  equipment, 
the  teaching  method  or  the  teacher  performance. 

The  logistic-normal  model  provides  a flexible  way  of  accounting  for  correlation 
between  subjects  within  the  same  group.  It  models  the  logit  of  the  probability  of 
success  for  subject  j in  group  i as 

logit  (7Tjj)  = x'ij/3  + Ui , (2.2) 

where  logit^)  = log(7r^/(l  — 7 r^))  and  7rt- j = Prob = l|iq).  Conditional  on 
Ui,  the  responses  are  assumed  to  be  independent  Bernoulli^).  Specification  of  the 
logistic-normal  model  is  completed  by  assuming  that  Ui  ~ N(0,al).  The  inclusion 
of  a random  school  effect,  Ui,  induces  a positive  correlation  between  the  responses 
for  the  students  in  the  same  school. 

Clearly,  the  logistic-normal  model  is  a special  case  of  the  GLMM.  It  uses  the 
logit-link  function  to  model  Bernoulli  responses  and  assumes  normally  distributed 
random  effects.  Let  us  now  focus  our  attention  on  count  data.  The  modeling 
of  count  data  is  often  complicated  by  the  presence  of  over-dispersion.  In  the 
next  paragraph  we  present  an  example  of  a GLMM  that  is  useful  to  model  over- 
dispersed count  data. 

Poisson-Normal  Model.  The  following  example  is  taken  from  Agresti  et  al. 
(2001).  They  consider  data  from  the  1990  General  Social  Survey,  where  one  of 
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the  questions  asked  was  “Within  the  past  12  months,  how  many  people  have  you 
known  personally  that  were  victims  of  homicide?”  One  natural  model  for  count 
data  of  this  kind  is  a Poisson  regression  model.  A severe  limitation  of  the  Poisson 
model  is  that  the  variance  must  be  identical  to  the  mean.  However,  count  data 
often  show  over-dispersion,  with  the  variance  exceeding  the  mean.  Indeed,  Agresti 
et  al.  (2001)  reported  that  the  mean  response  for  blacks  was  .522  with  a variance 
of  1.150;  for  whites  the  mean  was  .092  with  a variance  of  .155.  The  ratio  of  the 
variance  to  the  mean  for  each  race  provides  evidence  of  over-dispersed  data  for  the 
Poisson  model. 

Generalized  linear  mixed  models  provide  a flexible  way  of  accounting  for 
over-dispersion  of  count  data  with  respect  to  the  Poisson  distribution.  Specifically, 
assume  that  the  distribution  of  the  data  is  Poisson;  that  is,  one  models  the  log  of 
the  mean  response  of  subject  j in  (ethnicity-)  group  i as 

log  {fJ-ij)  = ^(3 + Uij,  (2.3) 

where,  conditionally  on  rq,  response  is  assumed  to  stem  from  a Poisson  (//re- 
distribution. Assuming  that  iq  ~ N(0,  of)  leads  to  the  Poisson-normal  model. 

The  inclusion  of  the  random  effect,  tq,  typically  accounts  for  some  of  the  extra 
variability  in  the  data. 

The  previous  two  examples  were  special  cases  of  the  GLMM  and  focused  on 
normally  distributed  random  effects.  However,  other  distributional  assumptions  are 
possible.  In  particular,  changing  the  random  effects  distribution  in  the  previous 
example  from  normal  to  one  which  is  conjugate  to  the  Poisson  distribution  leads  to 
an  example  of  a HGLM. 

Poisson-Gamma  Model.  Consider  the  previous  example.  If  we  assume  that 
the  random  effects,  u^,  vary  according  to  a gamma  distribution,  we  obtain  the 
Poisson-gamma  model.  Since  the  gamma  distribution  is  conjugate  to  the  Poisson 
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Table  2.1:  Pregnancy  rates  for  women  under  18  in  13  North  Central  Florida  coun- 
ties over  the  three-year  period  1989-1991.  (Gainesville  Sun,  April  30,  1994) 

275  50  110  104  21  8 41  7 30  243  129  38  22 

8544  1032  4851  2064  480  399  513  198  1050  8259  2946  1053  405 

Is*  row:  number  of  births  to  women  under  18  (?/;) 

2nd  row:  total  number  of  births  ( n* ) 

distribution  (and  since  we  assume  the  log-link  in  (2.3)  for  the  mean),  the  likelihood 
function  (discussed  later  in  Section  2.2)  is  analytically  tractable.  However,  a 
criticism  of  this  approach  is  that  the  gamma  distribution  does  not  allow  for 
modeling  complex  interdependence  between  multivariate  random  effects,  which  is 
often  a severe  limitation  to  HGLMs. 

Beta-Binomial  Model.  The  data  in  Table  2.1  concerns  pregnancy  rates  for 
girls  under  18  in  13  North  Central  Florida  counties.  Booth  and  Caffo  (2001) 
report  that  the  empirical  variability  in  the  child  pregnancy  rates  among  the 
counties  is  far  greater  than  the  binomial  sampling  variability  with  a common 
underlying  rate  (deviance  = 89.86  with  df=12).  One  approach  to  account  for 
the  extra  variability  in  the  data  is  to  use  the  logistic-normal  model  described 
earlier.  However,  instead  of  assuming  a normal  random  effects  distribution  one  can 
alternatively  use  a beta  distribution.  Notice  that  the  beta  distribution  is  conjugate 
to  the  binomial  distribution.  This  approach  leads  to  the  beta-binomial  model  in 
which  the  child  pregnancy  rates  vary  randomly  among  the  counties  according  to  a 
beta  distribution;  that  is,  conditional  on  Ui,  i = 1, . . . , 13,  one  assumes  that 

yi\ui  ~ Binomial(n,,  Ui ),  (2.4) 


where  Ui  ~ Beta(o:, /3). 
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2.1.2  The  Linear  Mixed  Model  (LMM) 

The  linear  mixed  model  is  a special  case  of  the  GLMM,  which  uses  the  identity 
link  function  and  assumes  normally  distributed  responses.  In  this  case  the  response 
yi  is  a linear  combination  of  fixed  and  random  effects  and  an  error  term  e*, 

Mi  = xi/3  + Ku  + ei- 

The  vector  of  errors,  e = (ei, . . . , en)',  follows  a multivariate  normal  distribution, 
A/”(0,  W-1),  with  W = (1/<Tq)I,  and  is  uncorrelated  with  the  vector  of  random 
effects,  u ~ Af{0,  G). 

The  LMM  has  been  discussed  extensively  in  the  literature  (see,  e.g.,  Searle 
et  ah,  1992).  It  is  an  important  special  case  of  the  GHM,  since  many  techniques, 
developed  for  LMMs,  have  carried  over  to  the  theory  of  GLMMs  and  GHMs.  For 
example,  the  model-fitting  algorithms  of  Breslow  and  Clayton  (1993)  and  Wolfinger 
and  O’Connell  (1993)  developed  for  GLMMs  (and  discussed  later  in  Chapter  3) 
involve  iterative  fitting  of  LMMs. 

2.1.3  The  One-Way  Balanced  Mixed  Model  (OWMM) 

The  one-way  balanced  mixed  model  is  one  of  the  simplest  mixed  models  and 
is  a special  case  of  the  LMM.  Assume  that  a continuous  response,  y^,  has  been 
observed  on  subject  j in  group  i,  where  i = 1, . . . , m and  j = 1, . . . , r.  We  model  y 
as 

Vij  — y T Ui  T Cij , (2-5) 

where  y is  the  overall  mean,  Ui  the  effect  of  group  i,  and  the  error  associated 
with  the  jth  response  in  group  i.  The  Ui  s and  e^’s  are  assumed  to  be  independent 
random  samples  from  A^(0,cr^)  and  N(0,(Tq),  respectively.  Later  in  this  work  we 
consider  the  case  where  the  variance  components,  Uq  and  are  known  and  we 
are  interested  in  estimating  only  y.  Although  this  is  an  unrealistic  assumption  in 
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practice,  it  is  useful  for  illustrating  methods  and  concepts  discussed  later  in  this 
work. 


2.2  Maximum  Likelihood  Estimation 
Technical  issues  for  fitting  GHMs  receive  considerable  attention,  as  they 
present  many  difficulties.  Maximum  likelihood  (ML)  estimates  of  the  parameter 
xj)  are  obtained  using  the  marginal  likelihood  in  which  the  unobservable,  random 
effects  are  integrated  out 

L(^\y)  = j f{y,  u;  xjj)du  = / (y;  iff).  (2.6) 

In  what  follows  we  write  Z(t/>)  = logL(i/>|y)  for  the  log  of  the  likelihood  function, 
suppressing  the  dependence  on  the  data  y.  The  maximum  likelihood  estimator 
(MLE),  ifi,  is  usually  obtained  as  the  solution  to  the  scoring  equations  S (^)  = 0, 
where  the  score  function  S is  defined  as  the  gradient  of  the  log-likelihood,  that  is, 

(2.7) 

In  GHMs  the  likelihood  typically  does  not  have  a closed  form  expression. 
Except  for  a few  special  cases,  the  integral  in  (2.6)  is  analytically  intractable. 
Consider  the  following  example  for  illustration. 


2.2.1  Logistic-Normal  Model 

The  likelihood  for  the  logistic-normal  model  in  Section  2.1.1  is 
L(0, al\y)  = ft  (/ n e-PiV,Ml13  + U,))  v(u‘M 


duj 


(2.8) 


f=l  V j= u f1  + exP(xij^  + «i)]  ffl 
where  ip(x)  denotes  the  pdf  of  a standard  normal  random  variable.  The  right  hand 
side  of  equation  (2.8)  contains  the  product  of  m one-dimensional  integrals.  Notice, 
however,  that  none  of  these  integrals  has  a closed  form  solution  and  therefore 
approximate  methods  must  be  used  to  evaluate  the  likelihood  function. 
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An  additional  difficulty  associated  with  GHMs  is  that  the  intractable  integral 
in  (2.6)  often  is  of  very  high  dimension.  In  the  examples  considered  so  far  we  have, 
for  simplicity  of  exposition,  assumed  that  the  random  effects  Uj  are  univariate 
and  uncorrelated.  In  practice,  this  is  often  not  the  case.  Consider  the  following 
modification  of  the  logistic-normal  example  from  Section  2.1.1. 

2.2.2  Logistic-Normal  Model  with  Multivariate  Random  Effects 

A simple  modification  of  the  logistic-normal  model  from  Section  2.1.1  is 
achieved  by  supposing  that  data  is  available  for  T different  years.  Thus,  we  observe 
binary  responses,  yhij,  where  yhij  is  the  performance  of  student  j,  ( j = 1, . . . ,r), 
in  school  i,  ( i = 1, . . . , m),  from  school-year  h,  ( h = 1, . . . , T).  We  still  want  to 
account  for  an  unobservable  effect  due  to  school  i.  However,  the  effect  of  school 
i may  change  over  the  course  of  T years.  Let  uhi  denote  the  effect  from  school  i 
for  year  h,  and  let  u*  = (uu, uTi)' . We  assume  that  u*  ~ 7V(0,  G<).  This 
approach  allows  effects  from  different  years  (for  the  same  school)  to  be  correlated. 
The  (modified)  logistic-normal  model  is 


logit  (jthij ) = XfcyjS  + Ufci. 


(2.9) 


Then,  the  likelihood  function  from  equation  (2.8)  changes  to 


exp (y/lij (x^  p + Uhi ))  exp{-|u'Gl  V} 

77- , , n . 77 ,7  i.  , wo du, 


[1  + exp  (x'Wi/3  + uM)]  |2t rGil-1/2 


(2.10) 


a product  of  m T-dimensional  (intractable)  integrals. 


CHAPTER  3 

DETERMINISTIC  MAXIMUM  LIKELIHOOD  COMPUTATION 


In  this  chapter  we  introduce  several  deterministic  approaches  for  dealing  with 


the  intractable  likelihood  function  in  (2.6).  Section  3.1  describes  an  analytical 
approach  based  on  a Laplace  approximation  to  the  intractable  integral.  Section 
3.2  introduces  the  general  ideas  of  quadrature  methods.  The  drawback  to  these 
first  two  methods  is  that  they  may  not  perform  well  in  GHMs.  Laplace  approxi- 
mations can  produce  inconsistent  parameter  estimates  and  quadrature  methods  are 
generally  not  recommended  when  the  dimension  of  the  integral  is  large. 

In  Sections  3.3  and  3.4  we  proceed  by  introducing  two  iterative  methods,  the 
Newton-Raphson  and  the  EM  algorithm.  We  explain  the  general  ideas  of  these  two 
algorithms  and  illustrate  them  on  several  examples.  In  Section  3.5  we  highlight  the 
similarity  between  the  two  algorithms  and  discuss  several  important  modifications 
of  the  EM  algorithm  in  Section  3.6. 


One  approach  for  dealing  with  the  intractable  integral  in  (2.6)  is  to  use  an 
analytical  approximation.  The  penalized  quasi-likelihood  (PQL)  methods  of 
Breslow  and  Clayton  (1993)  and  Wolfinger  and  O’Connell  (1993)  are  essentially 
derived  from  a Laplace  approximation  to  the  integral  in  (2.6).  The  basic  idea  of  the 
Laplace  approximation  (see,  e.g.  De  Bruijn,  1958)  is  as  follows.  Suppose  we  want  to 
approximate  the  integral 


3.1  Penalized  Quasi-Likelihood 


(3.1) 
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which  is  the  form  of  the  GHM  likelihood  for  distributions  in  the  exponential 
family.  Let  x be  the  mode  of  ^(-),  satisfying  dg(x.)  / dx\x=%  = 0,  and  define 
£ = — (d2g(x)/dxdx'\x=x)~1.  A second  order  Taylor  expansion  of  the  integrand  in 
(3.1)  about  x yields 

J exp {n(7(x)}dx«  J exp  jn#(x)  — ^(x  — x)£-1(x  — x)'|  dx.  (3.2) 

But  the  rightmost  term  in  equation  (3.2)  can  be  identified  as  the  kernel  of  a normal 
random  variable  with  mean  x and  covariance  matrix  E/n.  It  follows  that 

J exp{n^(x)}dx  « \2TrT,/n\1^  exp{ng'(x)}.  (3.3) 

Equation  (3.3)  is  the  basic  form  of  the  Laplace  approximation. 

However,  PQL  is  now  known  to  produce  inconsistent  parameter  estimates  and 
the  approximation  can  be  very  poor  when  the  random  effects  variance  components 
are  not  small  (Kuk,  1995;  Breslow  and  Lin,  1995;  Lin  and  Breslow,  1996;  Jiang, 
1998). 


3.2  Quadrature 

Another  common  approach  for  handling  the  intractable  integral  in  (2.6) 
is  numerical  integration  via  Gauss-Hermite  quadrature.  Quadrature  methods 
approximate  integrals  by  sums  based  on  weights  and  nodes  computed  from  the 
integrand.  The  fundamental  idea  of  Gauss-Hermite  Quadrature  is  as  follows.  In 
order  to  approximate  an  integral  of  the  form 

J f(x)e~x2dx,  (3.4) 

which  is  the  form  of  the  GHM-likelihood  in  the  case  of  normally  distributed 
random  effects,  one  has  to  find  the  roots  xk,(k  = 1 of  the  Kt  h Hermite 

polynomial,  HK , and  corresponding  weights,  wk.  These  are  tabulated  in,  for 
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example,  Abramowitz  and  Stegun  (1992).  Then,  one  approximates  (3.4)  by 

r 2 K 

/ f(x)e~x2dx  » ^wkf(xk).  (3.5) 

^ k= 1 

The  weights  are  chosen  so  that,  if  f(x)  is  a polynomial  of  maximal  degree  (2 K — 1), 
the  approximation  in  equation  (3.5)  is  exact. 

Adaptive  Gauss-Hermite  quadrature  methods  center  the  integrals  about  the 
mode  of  the  integrand  and  decrease  the  number  of  nodes  needed  (Liu  and  Pierce, 

1994) .  Adaptive  Gauss-Hermite  quadrature  forms  the  basis  for  SAS’s  procedure 
NLMIXED  (Version  7 and  8). 

In  principle,  the  error  in  (3.5)  can  be  made  arbitrarily  small  by  increasing  the 
number,  K,  of  quadrature  points.  However,  establishing  exactly  how  the  error  (in 
the  approximation  of  the  integral)  propagates  through  the  maximization  can  be 
difficult  (Crouch  and  Spiegelman,  1990).  Moreover,  the  accuracy  of  quadrature 
methods  can  be  very  poor  for  higher  dimensional  integrals  (Evans  and  Swartz, 

1995) . 


3.3  The  Newton-Raphson  (NR)  Algorithm 
The  Newton-Raphson  (NR)  algorithm  is  an  iterative  method  that  is  often  used 
to  approximate  the  maximum  (or  minimum)  of  a function.  Newton-Raphson  is 
a popular  tool  in  many  fields  and  it  is  often  used  in  statistics  to  approximate  the 
MLE. 

The  heart  of  any  iterative  method  is  the  rule  it  uses  to  find  the  next  estimate 
given  the  current  one.  Essentially  two  decisions  need  to  be  made:  In  which 
direction  will  the  next  estimate  be  in  relation  to  the  current  one;  and  how  far  will 
the  next  estimate  be  from  the  current  one?  These  are  termed  the  step  direction 
and  step  size  of  the  method.  The  direction  of  the  fastest  increase  of  the  log 
likelihood  is  the  vector  of  the  first  derivatives  or  gradient.  Hence  a general  class  of 
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gradient  methods  can  be  defined  as  follows:  Given  a current  approximation,  to 
the  MLE,  determine  the  update  by 

V>(m)  = t/>(t)  + s(t)M(t)S(V>W)  (3.6) 

In  this  equation  a scalar,  is  the  step  size  for  the  tth  step,  is  a modifier 
matrix  for  the  tth  step  and  S(ip)  is  the  score  function  defined  in  (2.7).  The  method 
of  steepest  ascent  is  a special  case  of  (3.6)  with  = I.  Unfortunately  this 
particular  choice  performs  very  poorly  in  practice  (Bard,  1974),  tending  to  converge 
very  slowly. 

The  Newton-Raphson  method  on  the  other  hand  replaces  in  (3.6)  by 

-H(^w)  x,  where 

HW  = “ Wsw  <3'7) 

is  the  Hessian  matrix  of  the  log-likelihood  1 . Thus,  the  (t  + l)st  NR  update  has 
the  form 

t//t+1)  = - H(V>(t))_1S(^(t)).  (3.8) 

Newton-Raphson  enjoys  local  quadratic  convergence  to  a solution  of  the 
scoring  equations  (e.g.  Bronstein  and  Semendjajew,  1991).  Convergence  at  a 
quadratic  rate  means  that  the  number  of  correct  decimals  approximately  doubles  in 
each  iteration.  More  precisely,  if  and  are  two  successive  approximations 

to  the  MLE,  i/>,  then 

| t/>(t+1)  - v>|  « - V>|2,  (3.9) 


1 Within  the  context  of  Newton-Raphson,  it  is  common  to  refer  to  H(t/>)  as  the 

Hessian  matrix.  Notice,  however,  that  — H(^)  equals  the  observed  information 
matrix,  which  is  often  denoted  by  J0  (see  Section  7.1). 
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where  b > 0.  We  emphasize,  however,  that  NR’s  quadratic  convergence  rate 
typically  only  holds  locally,  in  a neighborhood  of  the  MLE.  Therefore,  equation 
(3.9)  may  only  be  true  close  to  convergence  of  the  algorithm. 

The  popularity  of  NR  as  an  optimization  routine  is  based  on  the  fact  that 
most  other  methods  do  not  achieve  a quadratic  convergence  rate.  However,  NR’s 
fast  rate  of  convergence  is  often  offset  by  several  disadvantages.  One  disadvantage 
is  that  NR  requires  the  computation  of  the  gradient  and  the  Hessian  of  the  log- 
likelihood  at  each  iteration  which  can  be  a very  tedious  computational  task. 
Moreover,  at  each  iteration  the  inverse  of  the  Hessian  has  to  be  calculated.  This  is 
generally  done  numerically  which  can  lead  to  severe  inaccuracies  due  to  rounding 
error  (see  Searle  et  al.,  1992,  p.142).  Several  modifications  of  NR  avoid  these 
problems.  Quasi-Newton  techniques,  for  example,  replace  the  inverse  of  the 
Hessian  by  an  approximation  that  only  uses  the  gradient  of  the  log-likelihood.  The 
downside  of  these  modifications  is  that  they  do  not  retain  NR’s  quadratic  rate  of 
convergence. 

Another  disadvantage  of  NR  is  that  it  only  converges  for  “sufficiently  good” 
starting  values  in  the  neighborhood  of  the  MLE.  Without  extra  computational 
effort  to  detect  good  starting  values,  NR  can  easily  diverge  to  values  on  the 
boundary  of  the  parameter  space  and  produce  meaningless  parameter  estimates, 
such  as  negative  variance  estimates  (Thompson  and  Meyer,  1986;  Callanan  and 
Harville,  1991).  But  choosing  “good”  starting  values  requires  some  prior  knowledge 
about  the  solution,  which  in  practice  is  often  not  available.  Alternatively,  “trial- 
and-error”  methods  or  “grid-search”  procedures  can  be  used  to  find  appropriate 
starting  points.  However,  these  methods  can  prove  to  be  very  time  consuming, 
especially  in  high  dimensional  problems. 

We  now  consider  two  examples  to  illustrate  and  further  investigate  the 
performance  of  the  Newton-Raphson  algorithm. 
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Figure  3.1:  Convergence  of  Newton-Raphson  in  the  beta-binomial  model. 


Example:  OWMM.  Consider  the  OWMM  introduced  in  Section  2.1.3.  We 
show  in  the  Appendix  (Section  A. 4,  equations  (A. 18)  and  (A. 19)),  that  the  score 
function  and  the  Hessian  are 

rnr  TTIV 

= d1  “ b )(/•*  - A),  H(n)  = y(l  - b),  (3.10) 

°0  a0 

where  b = ( rcrf)/(raj  + ctq)  and  p,  = y„.  It  follows  that  the  (t  + l)st  NR  update  is 

*x(t+1)  = n(t)  - = p.  (3.11) 


That  is,  NR  converges  in  one  iteration.  This  was  to  be  expected  as  for  this  model 
the  log-likelihood  is  a quadratic  function  in  //  and  NR,  by  construction,  finds  the 
extreme  value  of  a quadratic  function  in  only  one  iteration  (see,  e.g.,  Searle  et  al., 
1992). 
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Example:  Beta-Binomial  Model.  Consider  the  beta-binomial  model  introduced 
in  Section  2.1.1.  Notice  that  the  (marginal)  likelihood  is  available  in  closed  form. 

In  particular, 


13 


/ \ 


£(a,/3;  y)  = J j 


2=1 


Ui 

Vi 


V yi 


B(pr,0) 


(3.12) 


where  B(x;  y)  denotes  the  beta  function  with  parameters  x and  y.  For  the  data  in 
Table  2.1,  Booth  and  Caffo  (2001)  report  that  the  MLE  is  (a;0)  = (9.95;  240.8). 

Figure  3.1  shows  the  convergence  behavior  of  NR  for  maximizing  the  likelihood 
function  in  equation  (3.12)  for  several  different  starting  values.  Lines  1 through 
4 are  the  iteration  histories  for  the  starting  values  (c*(0);  0^)  e {(1, 2),  (1, 5), 

(1, 10),  (1,  20)},  respectively.  Notice  that  the  starting  value  for  the  a-component 
is  constant  over  all  four  runs.  Also,  compared  to  the  target  value,  0 = 240.8,  the 
changes  in  0 can  be  considered  rather  moderate. 

Notice  that  the  behavior  of  NR  is  similar  for  the  values  0^  = 5, 10  and  20, 
showing  some  improvement  as  0^  increases.  However,  NR  fails  for  0^  = 2;  for 
this  value,  the  method  breaks  down  and  diverges  to  the  boundary  of  the  parameter 
space.  This  indicates  that  even  small  changes  in  the  starting  value  can  have  a huge 
impact  on  the  performance  of  Newton-Raphson. 


3.4  The  EM  Algorithm 

Another  method  that  is  often  used  to  find  the  MLE  is  the  EM  algorithm.  The 
earliest  application  of  EM  dates  back  to  McKendrick  (1926).  However,  it  was  not 
until  the  work  of  Dempster  et  al.  (1977)  that  the  full  power  of  the  algorithm  began 
to  be  appreciated. 

The  EM  algorithm  is  an  iterative  procedure,  in  which  each  iteration  consists  of 
the  following  two  steps.  In  the  E-Step  one  computes  the  conditional  expectation  of 
the  complete  data  log-likelihood,  conditional  on  the  observed  data  and  the  current 
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parameter  estimate  if)®, 

Q(^ IV>(<))  = E (log/( y,u;^)|y;^(t))  • (3.13) 

In  the  M-Step  the  parameter  estimate  is  updated  by  the  maximizer  of  (3.13),  that 
is,  by  the  value  ip^t+1^  that  satisfies 

Q(V>(t+1)|V>(t))  > WlV>(t))  (3.14) 


for  all  values  of  ip  in  the  parameter  space.  Assuming  that  the  derivative  can  be 
passed  through  the  expectation  in  (3.13),  the  maximization  is  generally  accom- 
plished by  solving  the  EM  estimating  equations 


E(ip,ip{t])  = E ^7^1og/(y,u;  V>)  y;^(t)^=0.  (3.15) 

Equation  (3.15)  defines  the  parameter  update  as  an  implicit  function  of  ip^] 


ip{t+1)  _ M('0(t))  (3.16) 

say.  The  MLE  is  a fixed  point  of  the  function  M;  that  is  ip  = M(i/>).  Expanding 
M(^))  in  a first  order  Taylor  series  about  the  fixed  point  results  in  the  approxi- 
mate relation 

V>(t+1>  — « B [i \p[t)-ip],  (3.17) 

where  B(ip)  = dM(ip)  / dip'  and 

B = B(i/>).  (3.18) 

The  matrix  B is  often  called  the  rate  matrix  (e.g.  Meng  and  Rubin,  1991)  as  it 
governs  the  rate  of  convergence  of  the  algorithm.  Notice  that  the  smaller  the 
elements  of  B,  the  faster  EM  converges  to  the  MLE.  To  be  consistent  with  the 
common  notion  that  larger  values  indicate  faster  convergence,  some  authors  define 
the  speed  matrix  as  I — B (see  Meng,  1994). 
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EM  is  often  said  to  converge  at  a linear  rate.  This  is  presumably  because 
equation  (3.17)  shows  that  the  error  after  ( t + 1)  iterations  is  linearly  related  to  the 
error  after  t iterations.  Thus,  EM’s  convergence  is  slower  than  the  quadratic  rate  of 
Newton-Raphson . 

The  reasons  for  EM’s  popularity,  compared  with  other  numerical  methods,  are 
its  superior  stability  properties.  Dempster  et  al.  (1977)  and  Wu  (1983)  show  that 
the  algorithm  increases  the  value  of  the  observed  log-likelihood  at  each  iteration. 
Redner  and  Walker  (1984,  p.201)  note  that  “EM  [. . .]  has  been  found  in  most 
instances  to  have  the  advantage  of  a reliable  global  convergence,  low  cost  per 
iteration,  economy  of  storage  and  ease  of  programming,  as  well  as  certain  heuristic 
appeal.” 

However,  it  is  EM’s  rate  of  convergence  that  is  often  criticized.  Indeed,  Redner 
and  Walker  (1984,  p.201)  also  comment  that  “EM’s  convergence  can  be  madden- 
ingly slow  in  simple  problems  which  are  often  encountered  in  practice.”  Rubin 
(1991)  points  out  that  the  speed  of  convergence  of  EM  is  proportional  to  1 — b, 
where  b is  the  fraction  of  missing  information.  The  fraction  of  missing  information 
is  defined  as  the  ratio  of  missing-to-complete  information,  or  equivalently,  as  one 
minus  the  ratio  of  observed-to-complete  information  (see  Section  7.1  for  a more 
thorough  discussion  of  this  issue).  Consider  the  following  example  for  illustration. 

Example:  OWMM.  We  show  in  the  Appendix  (Section  A. 4,  equation  (A.  17)), 
that  the  ( t + l)st  EM  update  for  the  OWMM  is 

H{M)  = p + b(n®  -/}),  (3.19) 

where  b = cr\/(cr\  + er l/r).  Notice  that  0 < b < 1.  Equation  (3.19)  implies  that 
the  approximate  relation  in  (3.17)  holds  exactly.  Furthermore,  the  error,  fjSt+v>  — p, 
of  the  EM  update  after  (t  + 1)  iterations  equals  bt+1d0,  where  d0  = ^ — p,  the 
initial  error.  Therefore,  the  smaller  the  value  of  b,  the  faster  EM  will  converge  to 
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Figure  3.2:  EM’s  rate  of  convergence  in  the  OWMM. 


the  MLE.  But  notice  that  b measures  the  fraction  of  missing  information  (about  /i). 
Indeed,  notice  that  the  observed  data  information  is 


d2  log  L(n\y) 
d/j ? 


n=p, 


m 

ai  + ao/r 


(3.20) 


Similarly,  the  complete  data  information  is  (see  Section  7.1  for  more  details) 


-E 


’ d 2 log/(y,u;/z) 

i 

ca 

■fe 

to 

y i 

Thus,  one  minus  the  ratio  of  observed-to-complete  information  is 


(3.21) 


1 - 


°l/r 

a2  + a2/r 


= b; 


that  is,  the  fraction  of  missing  information. 

Figure  3.2  displays  EM’s  rate  of  convergence  for  increasing  fractions  of  missing 
information,  b 6 {0.3, 0.6, 0.9}.  For  each  of  these  three  values  of  b a set  of  data  was 
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Figure  3.3:  Convergence  of  the  EM  algorithm  in  the  beta-binomial  model. 

simulated  from  the  OWMM  (with  fixed  parameters  m = 10;  r = 10;  /r  = 0;  ctq  = 1), 
resulting  in  the  MLEs,  0,  0.3  and  0.6,  respectively.  These  are  displayed  by  the 
horizontal  lines  in  Figure  3.2.  Next,  the  EM  algorithm  was  applied  to  each  of  the 
three  data  sets  with  the  same  starting  value,  ^0)  = 1.  The  dashed  lines  show  the 
histories  of  the  first  30  EM-iterations. 

Notice  that  for  b = 0.3,  the  MLE  is  the  furthest  away  from  the  starting  value. 
However,  the  algorithm  converges  at  the  fastest  pace,  reaching  the  MLE  after  about 
10  iterations.  Conversely,  for  b = 0.9,  the  MLE  is  closest  to  the  starting  value. 

But  even  after  30  iterations,  EM  has  not  converged  yet.  This  illustrates  that  the 
smaller  the  fraction  of  missing  information,  the  faster  is  EM’s  rate  of  convergence. 

The  following  example  allows  us  to  compare  the  convergence  behavior  of  EM 
with  that  of  Newton-Raphson.  In  addition,  it  also  sheds  light  on  the  stability  of 
both  algorithms  with  respect  to  the  choice  of  the  starting  value. 
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Example:  Beta-Binomial  Model.  Consider  again  the  beta-binomial  model 
introduced  in  Section  2.1.1.  In  Figure  3.1  we  illustrated  the  convergence  of  NR  for 
this  model  using  four  different  starting  values.  Figure  3.3  shows  the  convergence  of 
EM  for  the  same  four  starting  values.  In  particular,  the  iteration  histories  of  EM 
are  given  by  the  four  solid  lines  in  Figure  3.3,  whereas  the  dashed  lines  reproduce 
the  corresponding  iteration  histories  of  NR  from  Figure  3.1. 

Our  first  observation  is  that  the  four  runs  of  EM  are  almost  indistinguishable 
from  each  other.  (Especially  for  the  ^-component,  all  four  iteration  histories  seem 
to  lie  on  the  same  line).  This  implies  that  EM’s  convergence  is  almost  identical 
for  each  of  the  four  starting  values.  In  contrast,  recall  that  NR’s  behavior  strongly 
depended  on  the  starting  point.  This  observation  certainly  suggests  that  EM  is  a 
more  stable  algorithm;  moderate  changes  in  the  starting  values  barely  affect  the 
convergence  of  EM.  In  particular,  recall  that  NR  did  not  converge  at  all  for  the  first 
starting  value  (a^;j3^)  = (1;2).  However,  this  point  does  not  cause  any  problems 
for  EM  at  all. 

A second  observation  in  Figure  3.3  is  that  EM  reaches  the  neighborhood  of 
the  MLE  faster  than  NR.  Clearly,  at  the  first  two  or  three  iterations,  EM  takes 
bigger  steps  than  NR.  However,  it  is  after  iteration  three  that  EM’s  criticized  slow 
convergence  sets  in.  It  can  be  seen  that  the  relative  gain  between  two  successive 
updates  after  iteration  three  is  much  smaller  for  EM  than  it  is  for  NR;  close  to  the 
MLE  NR  clearly  converges  at  a faster  rate. 

3.5  Analytical  Comparison  of  Newton-Raphson  and  EM 

Recall  that  the  EM  update  solves  the  equation  = 0 in  (3.15).  In 

most  practical  circumstances,  however,  there  will  be  no  explicit  solution  to  this 
equation  and  an  iterative  routine,  such  as  Newton-Raphson,  is  needed  to  find  the 
EM  update  That  is,  i// t+1 ^ is  the  limit  of  the  sequence  {'i//t,r)}£L0,  where 
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tp^  = and  for  r > 0, 


'lp(t’r+ 1)  — 


-l 


i\>= 


F(V?(t’r),^(t))- 


(3.22) 


Typically,  the  ( t + l)st  step  of  EM  consists  of  iterating  (3.22)  until  convergence. 
However,  a modification  of  the  EM  algorithm,  proposed  by  Lange  (1995),  is  to 
complete  only  one  NR  iteration  of  (3.22)  in  the  (t  + l)st  step.  Specifically,  if  we 
define 

H«M>,  = </-W)  = W \tm),  (3.23) 

the  Hessian  of  the  Q-function  in  (3.13)  2 . Let  H q(V,W)  = H Q(ip^\xp^).  Lange 
(1995)  defines  a modified  EM  algorithm  as 


tp(t+l)  = xp{t)  - Hq(^W)-1F(^W,  V»W).  (3.24) 


On  the  other  hand,  recall  that  Newton-Raphson  in  (3.8)  was  defined  as 


ip[t+l)  = ^(t)  _ H(^W)-1S(^(t)),  (3.25) 


where  S and  H are  the  gradient  and  the  Hessian  of  the  log-likelihood,  respectively. 
A close  relationship  between  algorithms  (3.24)  and  (3.25)  is  now  revealed  by  the 
identity 


S(V>)  =E 


' d_ 
_ dip 


log/(y,u;  ip) 


(3.26) 


(refer  to  the  Appendix,  Section  A.l,  for  details).  This  implies  that  S (xp)  = F(rp,xp) 
and  thus  the  difference  between  the  algorithms  (3.24)  and  (3.25)  is  only  due  to 
the  difference  between  the  two  Hessians,  Hq  = dF/dip'  and  H = dS/dxp' . 


2 Similar  to  the  negative  Hessian,  — H(V>),  which  is  often  referred  to  as  the  ob- 
served information,  — H q(xP)  is  also  called  the  complete  information  and  denoted 
by  Jc  (see  Section  7.1). 
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Furthermore,  we  show  in  the  Appendix  (Section  A. 2)  that 


H(^)  = H0(^)+Var 


l°g/(y,u;V>) 


(3.27) 


where 


Var 


log /(y,  u;  -0) 


(3.28) 


is  often  referred  to  as  the  missing  information.  (See  Section  7.1  for  a more  thor- 
ough discussion  of  this  issue.) 

However,  since  Var(<91og/(y,  u;  ifr)  / dij)\y;  ip)  is  semi-positive  definite,  it  follows 
from  equation  (3.27)  that  H(^>)  < H q(iP)  3 . This  means  that  the  NR  step  sizes 
are  larger  than  those  of  the  modified  EM  algorithm  in  (3.24),  which  certainly  helps 
to  explain  why  typically  EM  is  more  stable  than  NR. 

We  want  to  emphasize  that  this  last  paragraph  in  no  way  contradicts  our  pre- 
vious findings.  Recall  that  in  Figure  3.3  we  observed  that  at  the  initial  iterations 
(between  iteration  number  one  and  three,  say)  EM  takes  larger  steps  than  NR. 
However,  also  recall  that  in  this  example  we  used  the  original  EM  algorithm,  which 
executes  a complete  M-step  in  each  iteration  (in  contrast  to  Lange’s  modified  EM 
in  (3.24),  which  only  carries  out  one  NR-iteration  of  the  M-step).  To  compare  the 
performance  of  the  original  EM  with  Lange’s  modified  version,  refer  to  Figure  3.4. 

It  shows  the  first  nine  iterations  of  EM  (solid  lines),  NR  (dashed  lines,  labeled  “1”) 
and  Lange’s  EM  (dashed  lines,  labeled  “2”).  Clearly,  the  convergence  behavior  of 
Lange’s  modified  EM  resembles  more  NR  than  it  does  EM.  Notice  in  particular  the 
smaller  step  sizes  of  modified  EM  (compared  to  NR)  in  the  later  iterations. 


For  two  symmetric  matrices,  A and  B,  we  define  A < B,  if  x'(A-B)x  < 0,Vx. 
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Figure  3.4:  EM,  Lange’s  modified  EM  and  NR  in  the  beta-binomial  model. 

3.6  Modifications  of  EM 

In  this  section  we  discuss  several  important  modifications  of  the  EM  algorithm. 
We  start  with  a modification  that  facilitates  the  computation  of  the  M-step. 

Meng  and  Rubin  (1993)  introduce  the  ECM  (Expectation  / Conditional  Maxi- 
mization) algorithm  for  situations  where  the  EM  algorithm  is  unattractive  because 
the  M-step  is  computationally  difficult.  The  ECM  algorithm  takes  advantage  of  the 
simplicity  of  complete-data  conditional  maximum  likelihood  estimation  by  replacing 
a complicated  M-step  of  EM  with  several  computationally  simpler  CM-steps.  Meng 
and  Rubin  show  that  the  ECM  algorithm  shares  the  same  convergence  proper- 
ties of  EM,  such  as  increasing  the  likelihood  at  every  iteration.  Their  algorithm 
takes  the  following  form.  Let  the  parameter  of  interest,  ip,  be  partitioned  into  S 
(possibly  vector- valued)  components,  ip  — {ipx, . . . , ips).  For  s — 1, . . . , S,  let 

= (ip^+1\  ip2+l\  ... , ip^+1\ip^l  i, . . . , ip^)  be  the  s/<S-fraction  of  the  full 
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(t  + l)st  update;  that  is,  the  parameter  estimate  obtained  after  the  first  s condi- 
tional M-steps.  Then  in  the  (s  + l)st  CM-step  of  the  ( t + l)st  iteration,  ECM  finds 
the  value  such  that 

g(^(t+^i) |v>(t+^)  > Q(ip\^t+^),  v?/>  e 

where  denotes  the  parameter  space  and  ^s+1  the  subspace  that  satisfies  'Ls+1  = 

e ^ V4*+1),  • • • , +1) , ^+1,  2,  • • ■ , v4?)}- 

Liu  and  Rubin  (1994)  build  on  the  above  idea  and  develop  the  ECME  (Ex- 
pectation/ Conditional  Maximization  Either)  algorithm,  which  is  similar  to  the 
ECM  algorithm,  but  where  some  of  the  CM-steps  of  ECM,  which  maximize  the 
constrained  expected  complete-data  log  likelihood  function,  are  replaced  by  max- 
imizing the  constrained  actual  likelihood  function.  The  authors  show  that  this 
algorithm  shares  the  same  stable  monotone  convergence  as  EM  and  ECM,  but  can 
be  substantially  faster  than  either  EM  or  ECM,  measured  using  actual  computing 
time.  The  drawback  is  that  in  most  cases  where  the  EM  algorithm  is  useful,  the 
actual  likelihood  function  is  not  available  in  closed  form  and  hence  this  method  will 
not  be  applicable  in  many  situations  of  interest. 

The  two  previous  methods  are  in  principle  not  designed  to  speed  up  the 
convergence  of  the  EM  algorithm.  Their  motivation  is  to  simplify  the  M-step 
and  save  computing  time.  However,  they  generally  do  not  decrease  the  number 
of  iterations  needed.  Several  methods  have  been  proposed  under  the  title  “EM 
accelerators” . One  of  these  methods  is  Aitken  acceleration  investigated  by  Laird 
et  al.  (1987)  and  Louis  (1982). 

Laird  et  al.  (1987)  discuss  a univariate  and  a multivariate  Aitken  accelerator. 
We  consider  the  univariate  case  first.  There,  if  b denotes  the  univariate  analogue  of 
the  rate  matrix  B then  it  follows  from  equation  (3.17)  that,  if  b is  known,  after  the 
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( t + l)st  iteration  the  MLE  can  be  predicted  via 

« tp{t)  + ~ri^{t+1)  ~ ^(<))-  (3.29) 

The  Aitken  accelerated  EM  algorithm  uses  the  predicted  MLE  in  place  of  the  true 
EM  update;  that  is,  the  ( t + l)st  EM  update  is  replaced  by  the  right  hand  side  of 
equation  (3.29). 

In  the  multivariate  case,  1/(1  - b ) is  replaced  by  (I  - B)_1;  specifically,  in 
multivariate  Aitken  acceleration  the  MLE  is  predicted  according  to 

» t/>w  + (I  - B)_1(^(t+1)  - t/>(t)).  (3.30) 


In  practice,  b and  B in  equations  (3.29)  and  (3.30)  have  to  be  estimated.  Laird 
et  al.  propose  to  estimate  b by 


b = 


■0(*+i)  _ ^(t) 
ip{t)  _ i) 


(3.31) 


A multivariate  extension  of  this  idea  leads  to  an  estimate  for  B. 

Example:  Beta-Binomial  Model.  Figure  3.5  illustrates  the  performance  of  the 
Aitken  accelerated  EM  algorithm  in  the  beta-binomial  model.  For  the  starting 
value  (af(°);  ft0'*)  = (1;  2),  Figure  3.5  shows  the  first  5 iterations  of  EM  (dashed 
line)  and  accelerated  EM  (solid  line).  We  started  to  predict  the  MLE  after  the  first 
iteration;  that  is,  for  iterations  t = 2, . . . , 5,  we  replaced  the  true  EM  update  by  the 
predicted  MLE  in  equation  (3.30). 

Notice  that  the  accelerated  EM  algorithm  performs  very  well  in  the  beta- 
binomial  model.  The  first  prediction  step  (given  by  iteration  2 in  Figure  3.5)  brings 
the  accelerated  EM  algorithm  into  a close  neighborhood  of  the  MLE,  whereas  the 
EM  algorithm  is  still  far  from  convergence  at  this  point. 

Acceleration  methods  do  not  always  work  well,  however.  Laird  et  al.  suggest 
checking  that  the  prediction  actually  increases  the  likelihood  over  the  true  EM 
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Figure  3.5:  Aitken  accelerated  EM  in  the  beta-binomial  model. 

update.  Moreover,  equations  (3.29)  and  (3.30)  only  provide  good  approximations 
in  a neighborhood  of  the  MLE;  therefore  the  prediction  will  only  improve  over  the 
true  EM  update  when  the  algorithm  is  close  to  convergence. 

Many  other  modifications  exist  under  the  name  “EM  accelerators” . These 
include  Newton-Raphson  type  accelerators  by  Meilijson  (1989)  and  Jamshidian  and 
Jennrich  (1993),  and  an  acceleration  based  on  an  expansion  of  the  parameter  vector 
by  Liu  et  al.  (1998).  However,  these  modifications  are  beyond  the  scope  of  this 
dissertation. 


CHAPTER  4 

STOCHASTIC  MAXIMUM  LIKELIHOOD  COMPUTATION 

4.1  Introduction 

In  Chapter  3 we  introduced  Newton-Raphson  and  the  EM  algorithm,  two 
popular  methods  to  find  the  maximum  of  the  likelihood  function.  However,  in 
GHMs  neither  of  these  methods  are  typically  not  directly  applicable. 

Indeed,  we  have  argued  that  the  log  of  the  likelihood  function  in  (2.6)  is 
typically  analytically  intractable  for  GHMs.  Thus  its  derivative,  the  score  function 
S,  which  is  needed  in  every  iteration  of  NR,  is  generally  not  available  in  closed 
form.  Similarly,  for  the  EM  algorithm  we  need  to  compute  the  Q-function  in 
(3.13),  which  is  defined  as  the  conditional  expectation  with  respect  to  the  density 
tfMy;^),  where  2(u|y;-0)  oc  /(y|u;  t/>x)p(u;  ^>2).  But  since  the  normalizing 
constant  for  this  density  is  the  likelihood  function  in  (2.6),  Q (and  its  derivative  F) 
are  typically  not  available  in  closed  form. 

Therefore,  modifications  of  Newton-Raphson  and  the  EM  algorithm  are  needed 
in  order  to  make  these  algorithms  suitable  for  GHMs.  In  particular,  appropriate 
approximations  to  the  intractable  integrals  in  S and  F need  to  be  found.  In 
principle,  these  integrals  can  be  approximated  using  deterministic  methods  (for 
example,  using  a Laplace  approximation  or  numerical  integration).  However,  a 
disadvantage  of  deterministic  approaches  is  that  the  error  of  approximation  is 
typically  difficult  to  estimate.  Booth  and  Hobert  (1999)  use  this  point  to  motivate 
the  use  of  Monte  Carlo  integration. 

The  fundamental  idea  of  Monte  Carlo  integration  is  as  follows  (see,  e.g., 

Robert  and  Casella,  1999):  Suppose  we  want  to  approximate  an  integral  of  the 
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form 

1 = f f(u)g(u)du,  (4.1) 

where  g is  a probability  density.  If  we  can  draw  a random  (independent  and  identi- 
cal distributed)  sample  u^, . . . , from  the  density  g,  then  we  can  approximate 
(4.1)  by  the  empirical  average 

M 

(4.2) 

fc=l 

Notice  that  / is  an  unbiased  estimate  for  / and,  by  the  Strong  Law  of  Large 
Numbers,  I converges  almost  surely  to  / (as  M — > oo). 

In  contrast  to  deterministic  approaches  described  in  Chapter  3,  the  approxi- 
mation in  equation  (4.2)  is  of  stochastic  nature;  that  is,  the  error  of  approximation 
(the  Monte  Carlo  error),  I — I,  is  a random  variable  that  has  mean  zero  and 
variance 

VarU)  = j (/(«)  - I)2g{u)du.  (4.3) 

^ f f2(u)g(u)du  < oo,  then  Var(7)  can  be  estimated  from  the  sample  . . . , u ^ 

through 

1 M 

52  = A «E(/(“W)-h2.  (4.4) 

fc=l 

Then  by  the  Central  Limit  Theorem  and  Slutzky’s  Theorem, 

(/  — I)/s  A jV(0,  1),  as  M — »■  oo.  (4.5) 

Thus,  equation  (4.5)  can  be  used  to  construct  confidence  bounds  on  / — I. 

In  this  chapter  we  focus  on  stochastic  optimization  (or  Monte  Carlo  optimiza- 
tion). Monte  Carlo  integration  and  Monte  Carlo  optimization  are  often  closely 
related.  For  example,  combining  the  ideas  of  Monte  Carlo  integration  with  the 
concepts  of  a deterministic  optimization  technique,  such  as  the  EM  algorithm,  leads 
to  a stochastic  optimization  method. 


We  shall  describe  several  stochastic  optimization  methods  in  this  chapter.  In 
Section  4.2  we  describe  the  method  of  simulated  maximum  likelihood.  In  Section 
4.3  we  introduce  the  Monte  Carlo  EM  and  the  Monte  Carlo  Newton-Raphson 
algorithms,  stochastic  versions  of  EM  and  NR.  We  then  describe  the  fundamental 
ideas  behind  the  stochastic  approximation  method  introduced  by  Robbins  and 
Monro  (1951).  Combining  the  concepts  of  stochastic  approximation  with  those  of 
EM  and  NR  leads  to  two  further  algorithms,  the  stochastic  approximation  EM  and 
the  stochastic  approximation  Newton-Raphson  algorithms.  These  two  algorithms 
are  introduced  in  Section  4.4. 


4.2  Simulated  Maximum  Likelihood  (SML) 

Consider  the  likelihood  function  (2.6).  The  method  of  simulated  maximum 
likelihood  approximates  the  intractable  integral  in  (2.6)  using  Monte  Carlo  inte- 
gration. Specifically,  let  h( u)  denote  an  importance  sampling  distribution  whose 
support  includes  that  of  <?(•;  xp2).  For  example,  one  might  choose  h{ u)  = <7(11;  ip*2), 
where  xp 2 is  an  initial  guess  at  ip2.  An  equivalent  representation  of  the  likelihood  in 
(2.6)  is 

L{^\y)  = y*/(y|u;^i)^^^(u)du.  (4.6) 


Equation  (4.6)  suggests  a Monte  Carlo  approximation  to  the  likelihood  function  of 
the  form 


£(V>|y) 


1 

M 


M 


£/(y|U  “>;*,) 


g(  uW;Va) 
h(  u(fc)) 


(4.7) 


where  u(1),  ...,u(A/)  is  a random  sample  from  h.  This  gives  an  unbiased  estimate 
of  the  likelihood  regardless  of  the  choice  of  h.  The  SML  estimator,  tp,  maximizes 
(4.7).  That  is,  tp  satisfies 


L{rp |y)  > L(ip\y),Vip. 


One  problem  associated  with  SML  is  to  obtain  an  initial  guess  at  tp2.  A solution  is 
to  use  a two-  or  multi-stage  implementation  of  SML,  where  the  current  importance 
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sampling  distribution  is  chosen  based  on  the  maximizer,  x/j *2,  from  the  previous 
stage  (see,  e.g.,  Geyer  and  Thompson,  1992;  McCulloch,  1997). 

In  the  artificial  case,  when  xp2  is  known  and  one  can  simulate  directly  from  the 
random  effects  distribution  g , equation  (4.7)  simplifies  to 

1 M 

H^\y)  = /(ylu(fe);  Vh),  (4.8) 

1 k=  1 

where  u(1\  ...,  u(M)  is  a random  sample  from  g{ u;  ip2). 

Example:  OWMM.  Consider  the  OWMM  from  Section  2.1.3.  If  we  assume 
that  the  variance  components,  crfi  and  erf,  are  known  and  that  we  are  only  in- 
terested in  estimating  g,  then  we  can  use  approximation  (4.8)  with  importance 
sampling  distribution  iV(0,  cr?).  Figure  4.1  shows  the  plots  of  20  simulated  likeli- 
hood functions  based  on  (4.8)  for  different  Monte  Carlo  sample  sizes,  M e {5,50}, 
for  a simulated  set  of  data  from  the  OWMM  (with  parameters  m = r = 2, 

°o  = ai  = 1 and  MLE  g = 0).  Due  to  the  introduction  of  Monte  Carlo  error,  the 
simulated  likelihood  functions  vary  randomly  in  shape  as  well  as  in  location  about 
the  MLE.  Notice,  however,  how  the  variability  decreases  with  increasing  M. 

Importance  Sampling  for  SML.  The  main  practical  problem  when  applying 
SML  is  choosing  an  appropriate  importance  sampling  distribution  h.  In  principle, 
almost  every  choice  of  h will  produce  an  unbiased  and  consistent  estimator  of  the 
likelihood  function  (as  long  as  the  support  of  h includes  that  of  g).  However,  some 
choices  are  better  than  others.  Indeed,  some  importance  sampling  distributions  can 
result  in  infinite  variances  for  L (see,  e.g.,  Robert  and  Casella,  1999).  In  particular, 
importance  sampling  distributions  h with  lighter  tails  than  g (that  is,  those  with 
unbounded  ratios  g/h)  are  not  recommended. 

Many  methods  for  constructing  good  importance  sampling  distributions 
have  been  suggested  in  the  literature.  These  include  using  a mixture  of  several 
importance  distributions  and  stratified  sampling;  using  control  and  antithetic 
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Figure  4.1:  20  simulated  likelihood  functions  for  the  OWMM  for  different  M 
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variables;  and  also  the  use  of  non-parametric  methods  (see,  e.g.,  Oh  and  Berger, 
1993;  Zhang,  1996;  Owen  and  Zhou,  2000).  However,  the  choice  of  the  appropriate 
method  often  depends  on  the  model  and  the  data,  making  the  search  for  good 
importance  sampling  distributions  more  of  an  art  than  a science. 

4.3  Monte  Carlo  EM  and  Monte  Carlo  Newton-Raphson 
4.3.1  Monte  Carlo  EM  (MCEM) 

Recall  that  the  EM  algorithm  described  in  Section  (3.4)  requires  the  evaluation 
of  the  Q-function  in  equation  (3.13)  or  its  derivative,  F,  in  (3.15).  However, 
we  have  argued  at  the  beginning  of  this  chapter  that  Q and  F are  typically 
analytically  intractable  in  GHMs.  In  this  case,  a Monte  Carlo  version  of  the  EM 
algorithm  often  offers  an  appropriate  alternative. 

Wei  and  Tanner  (1990)  propose  to  approximate  an  intractable  Q-function 
by  Monte  Carlo  integration.  In  particular,  let  ip^  be  the  current  MCEM  update 
and  let  ,.t)  uTMd  be  a random  sample  of  size  Mt  from  p(u|y;  ip^),  the 
conditional  distribution  of  u given  y evaluated  at  ip  — ip^\  Then,  a Monte  Carlo 
approximation  to  Q is 

, Mt 

Q(ip\iP^)  = _-^log/(y,u(t’fc);^).  (4.9) 

* k=i 

The  Monte  Carlo  EM  algorithm  uses  Q in  place  of  Q\  that  is,  the  ( t + l)st  MCEM 
update,  ip't+1\  satisfies 

Q(-iP{t+1)\xP{t))>Q(iP\*pU),  Vip. 

Notice  that  in  equation  (4.9)  we  have  explicitly  allowed  the  Monte  Carlo  sample 
size,  Mt,  to  depend  on  the  iteration  number  t.  The  reason  for  this  is  that  MCEM 
typically  requires  Mt  to  be  increased  in  successive  iterations  in  order  for  the 
sequence  of  estimates  to  converge.  However,  we  will  sometimes  find  it  necessary 
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to  consider  MCEM  with  a constant  Monte  Carlo  sample  size  per  iteration.  In  this 
case,  we  will  simply  write  M,  omitting  the  subscript,  with  the  understanding  that 
the  sample  size  remains  fixed  for  all  t. 

We  have  pointed  out  that  the  EM  update  solves  the  estimating  equations 
0 = F with  F defined  in  (3.15).  A Monte  Carlo  approximation  to  F is 
straightforward.  Let 

-j  Mt  r\ 

f(^,  ip(t) ) = — los  /( y> u(t,fe);  '*/’)•  (4.io) 

Then  the  (t  + l)st  MCEM  update  solves  F(^>,^>0))  _ q 

Notice  that  Q in  (4.9)  and  F in  (4.10)  are  random  variates  which  have  ex- 
pectation Q and  F,  respectively.  (The  expectation  taken  with  respect  to  the 
conditional  density  g( u|y;^)(t^)).  Therefore  one  can  typically  decompose  MCEM 
into  two  components:  a deterministic  component  that  follows  the  EM  algorithm 
exactly,  and  a stochastic  component  due  to  Monte  Carlo  error.  In  other  words, 
MCEM  is  a stochastic  algorithm  that  follows,  on  average,  the  path  of  the  underly- 
ing deterministic  EM,  but  shows  random  variation  about  this  path.  Consider  the 
following  example  for  illustration. 

Example:  QWMM.  Consider  the  OWMM  from  Section  2.1.3.  We  show  in  the 
Appendix  (Section  A. 4,  equation  (A. 22)),  that  the  (t  + l)st  MCEM  update  is 

/x(t+1)  = + b(n(t)  - /t)  + et,  (4.11) 

where 

et  ~ Af(0,  cr2MC/M ) and  a2MC  = ball ( mr ) (4.12) 

and  b = a\j{a\  + a2/r).  Notice  that  equation  (4.11)  equals  the  update  of  the 
deterministic  EM  algorithm  in  (3.19)  plus  Monte  Carlo  error,  et.  Moreover,  since 
et  has  mean  zero,  MCEM  is  on  average  identical  to  EM.  However,  MCEM  varies 
randomly  about  the  EM  path  due  to  the  error  term  et.  Notice  that  the  Monte 
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Figure  4.2:  EM  and  MCEM  in  the  OWMM. 

Carlo  error  variance  is  proportional  to  1/M.  Thus,  for  a constant  Monte  Carlo 
sample  size  M,  the  error  variance  of  MCEM  does  not  decrease  and  therefore  the 
algorithm  does  not  converge.  This  fact  is  illustrated  graphically  in  Figure  4.2. 

Figure  4.2  shows  the  first  100  iterations  of  the  EM  algorithm  (thick  line)  for 
a simulated  set  of  data  from  the  OWMM.  Notice  that  EM  converges  after  about 
80  iterations  to  the  MLE.  Figure  4.2  also  shows  6 independent  runs  of  MCEM 
(thin  lines)  with  a constant  Monte  Carlo  sample  size  per  iteration.  We  observe 
that  MCEM  follows  the  path  of  EM  on  average,  but  individual  runs  vary  randomly 
about  this  average  path.  In  particular,  notice  that  MCEM  continues  to  show 
random  variation,  even  after  EM  converges.  Several  authors  (Wei  and  Tanner, 

1990;  McCulloch,  1997;  Booth  and  Hobert,  1999)  propose  to  increases  M as  the 
algorithm  progresses.  However,  most  of  the  suggestions  of  how  to  increase  M are 
somewhat  ad  hoc  and  driven  by  the  individual  application  of  the  algorithm.  Booth 
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and  Hobert  (1999)  are  the  first  to  develop  an  automated  sample  size  updating 
scheme  for  the  MCEM  algorithm. 

Automated  MCEM.  Booth  and  Hobert  (1999)  develop  an  MCEM  algorithm 
that  automatically  increases  M based  on  random  samples  from  <7(11  |y;t/>(t))-  They 
show  that  using  random  sampling  allows  the  Monte  Carlo  error  to  be  assessed 
at  each  iteration  using  standard  central  limit  theory  combined  with  Taylor  series 
methods.  These  can  be  used  to  construct  a sandwich  variance  estimate  for  the 
maximizer  at  each  approximate  E-step.  Specifically,  let  denote  the  MCEM 

update  which  satisfies  F(ip,ip^)  = 0 and  define  F{1)(x,  y)  = <9F(x,  y)/<9x.  If 
we  let  denote  the  value  that  satisfies  F(ip,tp^)  = 0 (that  is,  is 

the  true  EM  update  from  then  it  follows  that,  conditional  on  the  current 
iteration  value  ip^t+1^  is  approximately  normal  distributed  with  mean 
and  variance 

Var(V^t+1)  |V^)  ~ 

F(1)(t/?*(t+1),  ■0(t))-1Var[F(^*(t+1),  ■0(t))]F(1)(t/>*(t+1),  ■0(t))_1  (4.13) 

A sandwich  estimate  of  Var(y>(m)|^>(t))  is  obtained  by  substituting  in  place 

of  on  the  right-hand  side  of  expression  (4.13)  and  using  the  estimate 

Var  ^F(^*(t+1),  Vj(t)))  ~ 12  ^[dlog/(y’u(t’fe);^(^ 

Mt  fc=1  dip  dip 

This  suggests  a rule  for  automatically  increasing  the  Monte  Carlo  sample  size 
after  iterations  in  which  the  true  EM  step  is  swamped  by  Monte  Carlo  error.  In 
particular,  Booth  and  Hobert  suggest  that  after  the  (t  + l)st  iteration,  construct 
a 100(1  — a)%  confidence  ellipsoid  for  the  true  (but  unknown)  EM  step 
by  using  the  approximate  normality  of  its  Monte  Carlo  estimate  together 

with  the  estimate  of  its  approximate  variance  given  in  (4.13).  If  the  previous  Monte 
Carlo  EM  step,  lies  in  that  region,  then  the  EM  step  was  swamped  by  Monte 
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Carlo  error,  and  the  Monte  Carlo  sample  size  M should  be  increased.  Booth  and 
Hobert  suggest  a rule  of  the  form  Mt+l  <-  Mt(l  + a),  where  0 < a < 1.  In  their 
examples  they  used  a = 0.25  and  a € {|,  |},  but  the  optimal  choice  of  a and  a is 

a topic  that  needs  further  investigation. 

Stopping  Rules  for  Stochastic  Algorithms.  Booth  and  Hobert  also  address 
the  issue  of  a stopping  rule  for  stochastic  algorithms.  Deterministic  algorithms, 
like  NR  or  EM,  typically  stop  when  the  relative  change  in  the  parameter  values 
from  successive  iterations  is  small;  that  is,  they  use  a stopping  rule  (or  convergence 
criterion)  of  the  form 


where  Si  and  S2  are  predetermined  constants.  One  problem  with  this  stopping  rule 
in  the  context  of  stochastic  algorithms  (like  MCEM)  is  that  may  be  very 


not  because  the  algorithm  is  close  to  convergence.  To  reduce  the  risk  of  stopping 
prematurely  due  to  Monte  Carlo  error,  Booth  and  Hobert  suggest  to  stop  the 
algorithm  when  (4.14)  is  satisfied,  say,  C consecutive  times. 

Example:  OWMM.  Figure  4.3  shows  6 runs  of  automated  MCEM  for  a 
simulated  set  of  data  from  the  OWMM,  each  started  from  the  same  point,  ^ — 1, 
using  convergence  parameters  a = 0.25,  a = 1/5,  = 0.001,  d2  = 0.001  and  C = 3 

and  an  initial  Monte  Carlo  sample  size  of  M0  = 10.  The  first  row  in  Figure  4.3 
(plots  1 through  3)  shows  the  MCEM  histories  for  each  of  the  6 runs.  Plot  1 shows 
the  entire  history,  whereas  plots  2 and  3 only  show  windows  of  the  first  and  final  50 
iterations,  respectively.  The  second  row  in  Figure  4.3  (plots  4 through  6)  shows  the 
corresponding  sample  size. 

Observe  that  the  variability  of  the  MCEM  estimates  decreases  as  the  algorithm 
progresses  and  each  of  the  6 runs  converges  between  iteration  80  and  100.  However, 
Mt  increases  very  rapidly  at  the  later  stages  of  the  algorithm,  reaching  values 


(4.14) 


close  to  simply  because  of  a large  Monte  Carlo  error  associated  with  xpl't+1'>  and 
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Figure  4.3:  Automated  MCEM  and  Monte  Carlo  sample  sizes  in  the  OWMM. 

between  10,  000  and  40,  000  at  the  final  iterations.  This  indicates  that  extremely 
large  Monte  Carlo  sample  sizes  are  needed  for  convergence  of  MCEM.  On  the  other 
hand,  hardly  any  of  the  simulation  effort  is  spent  at  the  early  iterations.  Indeed, 
plots  5 and  6 show  that  the  Monte  Carlo  sample  sizes  barely  exceed  1, 000  in  the 
first  50  iterations. 

4.3.2  Monte  Carlo  Newton-Raphson  (MCNR) 

Recall  that  Newton-Raphson  requires  evaluating  the  score  function  S and 
the  Hessian  H.  However,  we  have  argued  at  the  beginning  of  this  chapter  that 
S (and  therefore  also  H)  are  typically  intractable  in  GHMs.  One  solution  is  to 
approximate  S and  H using  Monte  Carlo  integration.  This  leads  to  the  Monte 
Carlo  Newton-Raphson  algorithm. 
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A suitable  Monte  Carlo  approximation  to  S is  suggested  by  equation  (3.26) 
which  provides  a representation  for  the  score  function  of  the  form  S(ip)  — 
E[dlogf(y,u;ip)/  dip\y;ip}.  Therefore,  if  u^, . . . , denotes  a random 
sample  from  <?(u|y;  ip),  then  a Monte  Carlo  approximation  to  S is  given  by 

1 ^ f) 

sw  = (4-15) 

A Monte  Carlo  approximation  to  the  Hessian  matrix  H can  be  found  in  similar 
manner.  We  have  already  shown  (see  equation  (3.27))  that 


H {ip)  = H Q(ip)  + Var  ( ~ log  /(y,  u;  ip) 


y;^  , 


(4.16) 


where  H q{iP)  is  the  Hessian  of  the  Q-function  in  equation  (3.23).  Thus,  a Monte 
Carlo  approximation  to  the  first  term  on  the  right  hand  side  of  (4.16)  is 

M‘  Q2 


= w.  £ lo^(y- uW’  *)■ 


(4.17) 


Mt  ^ dip  dip 

Similarly,  an  approximation  to  the  second  term  on  the  right  hand  side  of  (4.16)  is 

,(*). 


Mt 


(4.18) 


fc=l 


dip 


Therefore,  we  can  approximate  H by 


%)  = Hq(^)  + V(^).  (4.19) 

Equation  (4.19)  offers  a very  appealing  approximation  to  H based  on  the  repre- 
sentation in  (4.16);  however,  other  approximations  are  possible.  Some  of  these  are 
discussed  later  in  this  section. 

The  Monte  Carlo  Newton-Raphson  algorithm  uses  S and  H in  place  of  S and 
H.  Thus,  MCNR  updates  the  parameter  estimate  according  to 


ip{t+1)  = ip®  - H (ip{t))~1S(ip(t)). 


(4.20) 
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Figure  4.4:  MCEM  and  MCNR  in  beta-binomial  model. 


Like  MCEM,  this  algorithm  also  does  not  converge  unless  the  Monte  Carlo  sample 
size  is  increased  as  the  algorithm  progresses. 

Example:  Beta-Binomial  Model.  Recall  the  performance  of  the  deterministic 
algorithms,  EM  and  NR,  for  the  beta-binomial  model:  EM  converged  for  both  of 
the  starting  values,  (1;  2)  and  (1;  20),  whereas  NR  only  converged  for  the  latter 
value  (see  Figures  3.1  and  3.3).  Figure  4.4  shows  the  behavior  of  the  corresponding 
Monte  Carlo  algorithms  with  constant  Monte  Carlo  sample  sizes,  M = 10. 

The  dashed  and  solid  lines  correspond  to  the  starting  values  (1;  2)  and  (1;  20), 
respectively.  MCEM  performs  as  expected:  it  reaches  the  neighborhood  of  the 
MLE  for  both  of  the  starting  values  and  oscillates  randomly  about  the  MLE  from 
there  on.  MCNR  diverges  for  the  value  (1;  2),  which  is  not  surprising  as  this  value 
also  caused  convergence  problems  for  the  deterministic  NR.  However,  the  starting 
value  (1;  20)  creates  unforeseen  problems  for  MCNR.  For  this  value,  MCNR 
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performs  as  expected  initially.  It  reaches  the  neighborhood  of  the  MLE  quickly  and 
then  fluctuates  randomly  about  this  point  for  the  next  few  iterations.  However,  the 
algorithm  then  breaks  down  and  diverges  to  the  boundary  of  the  parameter  space. 

Modifications  of  MCNR.  The  instability  of  the  Monte  Carlo  Hessian,  H,  in 
(4.19)  is  the  main  reason  for  the  erratic  behavior  of  MCNR.  If  M is  not  very  large, 
H can  vary  dramatically  from  one  iteration  to  the  next.  Since  H governs  the  step 
size,  such  erratic  behavior  can  result  in  huge  jumps  in  the  parameter  estimates. 

In  some  cases  the  estimates  may  jump  outside  the  domain  of  attraction,  causing 
MCNR  to  diverge.  We  have  found  that  the  following  modifications  of  MCNR  often 
work  well  in  practice. 

1.  Use  Hq 

We  have  argued  in  Section  3.5  that  H(^>)  < Hq(i/>)  (in  the  semi-positive 
definite  order),  causing  Lange’s  modified  EM  algorithm  to  take  smaller  steps 
than  NR.  This  implies  that  replacing  H by  Hq  in  the  corresponding  Monte 
Carlo  version  of  the  algorithm  will  lead  to  smaller  steps  sizes  for  MCNR  and, 
thus,  to  a more  stable  algorithm.  Notice,  however,  that  this  approach  results 
in  a modified  MCEM  algorithm,  rather  than  a NR-type  method. 

2.  Use  Hi 

A second  modification  is  to  use  an  average  of  Monte  Carlo  Hessian  matrices; 
in  particular,  if  we  let  H*  = H(^(l))  be  the  value  of  H evaluated  at  the  Ah 
parameter  update,  then  in  the  (t  + l)st  iteration  of  MCNR  we  replace  H(^(t)) 
by  the  average 


This  approach  smoothes  the  changes  in  the  Hessian  matrix  between  successive 
iterations.  However,  recall  that,  compared  to  EM,  NR  takes  rather  small 
steps  at  the  early  iterations  (see  Figure  3.3,  for  example).  Thus,  using  an 


(4.21) 
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Modification  1 - M=10 
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Modification  2 - M=10 
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Modification  3 - M=100 


Figure  4.5:  Modified  MCNR  in  beta-binomial  model. 


average  of  Hessians  may  lead  to  smaller  step  sizes  at  later  iterations  and  thus 
to  a slower  convergence  rate  for  MCNR. 

3.  Increase  M 

The  variability  of  H can  be  decreased  simply  by  using  a larger  Monte  Carlo 
sample  size.  However,  using  a larger  M may  decrease  the  overall  efficiency  of 
MCNR  measured  by  total  amount  of  simulation  or  by  total  computing  time 
needed  for  the  algorithm  to  converge. 

Example:  Beta-Binomial  Model.  Figure  4.5  illustrates  the  performance  of 
these  three  modifications  for  the  beta-binomial  model  with  the  starting  value 
(1;  20).  Notice  that  each  of  the  three  modifications  succeeds  in  stabilizing  MCNR. 
However,  using  an  average  of  Hessians,  Ht,  slows  down  the  algorithm  significantly. 
With  modifications  1 and  3,  MCNR  reaches  the  neighborhood  of  the  MLE  after 
about  six  to  eight  iterations,  whereas  using  H4,  it  takes  significantly  longer. 
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4.4  Stochastic  Approximation  (SA) 


In  Section  4.3  we  introduced  MCEM  and  MCNR,  two  iterative  methods  useful 
for  approximating  the  MLE.  Both  of  these  methods  require  that  the  Monte  Carlo 
sample  size  M be  increased  successively  for  convergence.  In  this  section  we  consider 
methods  that  converge  for  constant  M.  These  methods  combine  the  concepts  of 
EM  and  NR  together  with  those  of  stochastic  approximation. 

The  origin  of  stochastic  approximation  (SA)  dates  back  to  the  work  of  Robbins 
and  Monro  (1951).  It  is  for  this  reason  that  SA  (and  its  variants)  can  often  be 
found  in  the  literature  under  the  name  “Robbins-Monro”  procedure. 

Suppose  S(ip)  is  a function  with  scalar  argument  ip.  Suppose  further  that  we 
want  to  find  the  root,  ip,  of  S satisfying  S(ip)  = 0.  If  S is  known,  deterministic 
methods  like  Newton-Raphson  can  be  used  to  approximate  ip.  Robbins  and 
Monro  consider  a stochastic  generalization  of  the  above  problem  in  which  the 
precise  form  of  the  function  S is  unknown.  Instead  of  observing  S(ip)  exactly, 
suppose  we  only  observe  a random  variable  V (ip)  with  distribution  function 
Pr(V(ip)  < v)  = G(v\ip),  such  that 


Notice  that  in  GHMs,  S(ip)  can  be  considered  the  score  function  and  equation 
(3.26)  reveals  that  (4.22)  holds  with  V(ip)  = 31og/(y,  u;  ip) /dip  and  dG(v\ip)  = 
#(u|y ',ip)du.  Robbins  and  Monro  (1951)  show  that  for  a sequence  of  positive 
weights,  {7t}t>0,  such  that 


(4.22) 


^ 7t  = oo  and 


(4.23) 


the  Markov  chain,  {ip^}t> o,  defined  by 


•0(t+1)  = ip{t)  ~ 7 tvt 


(4.24) 
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converges  in  probability  to  the  solution  xp,  where  vt  satisfies  Pr( vt  < v\ip^>)  — 
G(v\ip^). 

Many  early  refinements  of  the  work  of  Robbins  and  Monro  exist,  most  of  which 
modify  the  convergence  details  of  ip®  — > xp  (see,  e.g.,  Blum,  1954;  Kallianpur, 

1954;  Wolfowitz,  1956).  Kiefer  and  Wolfowitz  (1952)  develop  a modification  of 
SA,  suitable  to  find  the  maximum  of  the  function  S (ip).  Kesten  (1958)  derives  an 
accelerated  version  of  SA,  which  takes  into  account  the  frequency  of  fluctuations 
in  the  sign  of  two  successive  iterations,  sign  (ip®  — xp^1^).  More  recently,  SA 
with  averaging  of  the  estimates,  xpt  = + 1),  has  been  proposed  as  a 

further  way  of  accelerating  the  algorithm  (see,  e.g.,  Gyorfi  and  Walk,  1996;  Wang 
et  al.,  1997;  Tang  et  al.,  1999).  Kushner  (1987)  studies  a combination  of  SA  and 
simulated  annealing  in  order  to  overcome  local  solutions.  Many  more  extensions 
and  modifications  of  SA  exist. 

Ruppert  (1985)  establishes  the  following  similarity  between  SA  and  Newton- 
Raphson.  Let  S (ip)  be  a function  from  TZ s to  V,s  and  ip  € 7 Is.  Ruppert  derives  a 
multivariate  version  of  SA  1 , 

ip(t+1)  = *P(t)  ~ Hr'St,  (4.25) 

where  A > 0 and  St  and  Hj  are  (stochastic)  estimates  for  S (ipt)  and  its  gradient, 
respectively.  Notice  that  omitting  the  down-weighting  factor  A/(t  + 1)  and 
replacing  St  and  Ht  in  (4.25)  by  the  score  function  S and  the  Hessian  H results 
in  the  Newton-Raphson  algorithm.  Equation  (4.25)  is  the  starting  point  for  our 
discussion  in  the  next  section. 


1 Ruppert  actually  considers  a version  of  SA  suitable  to  find  the  maximum  of 
the  function  S (ip)-  however,  it  is  straightforward  to  apply  his  ideas  to  the  case 

s(V0  = o. 
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4.4.1  Stochastic  Approximation  Newton-Raphson  (SANR) 

Let  u^, . . . , be  a random  sample  from  p(u|y;  The  SANR  algorithm 
updates  the  parameter  estimate  according  to 

V>(t+1)  = - 7 1 H(^(t))_1S(t/>w),  (4.26) 

where  {7<}^o  is  a sequence  of  decreasing  weights  satisfying  (4.23)  and  H and  S 
are  the  Monte  Carlo  Hessian  and  score  function  in  (4.19)  and  (4.15),  respectively. 
Notice  that  the  Monte  Carlo  sample  size  M is  typically  constant  over  all  iterations 
and  has  to  be  chosen  before  starting  the  algorithm. 

Gu  and  Li  (1998)  use  a special  case  of  the  algorithms  in  (4.25)  and  (4.26) 
with  the  specific  weight,  = 1/(1  + t).  Gu  and  Li  (1998)  also  propose  different 
choices  for  the  Monte  Carlo  Hessian  H.  They  suggest  using  Hq  in  (4.17)  at  the 
early  iterations  of  SANR  and  switching  to  in  (4.21)  when  “t  is  large”.  This 
recommendation  is  in  line  with  the  comments  made  in  Section  4.3.2  concerning  the 
stability  of  MCNR. 

Notice  that  the  difference  between  SANR  and  MCNR  is  only  due  to  the  down- 
weighting factor  7 1.  However,  this  down-weighting  factor  changes  the  properties  of 
the  algorithm  significantly.  Consider  the  following  example  for  illustration. 

Example:  QWMM.  Consider  the  OWMM  and  assume  that  the  Monte  Carlo 
sample  size  M is  constant  for  all  iterations.  Let  us  consider  MCNR  first.  We 
show  in  the  Appendix  (Section  A. 4,  equation  (A. 23))  that  the  Monte  Carlo  score 
function  is 

~ tut 

S(p)  = —[(l-b)(jji-li)  + et\1  (4.27) 

ao 

with  et  defined  in  equation  (4.12).  Recall  that  the  (exact)  Hessian  for  this  model  is 
H = —(1  — b)mr/al  (see  equation  (3.10)).  Using  H (instead  of  an  approximation 
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H),  the  difference  between  the  (t  + l)st  MCNR  update  and  the  MLE  is 


(4.28) 


where  e*  ~ N( 0,  (1  — b)~2a2MC/M)  and  a2MC  defined  in  (4.12).  Notice  that  the 
(t  + l)st  iteration  does  not  improve  over  the  tth  iteration  in  the  sense  of  a reduced 
variance;  in  fact,  for  constant  M,  the  variance  of  the  MCNR  estimate  does  not 
decrease  for  successive  iterations  and  therefore  MCNR  does  not  converge. 

Now  consider  SANR  in  (4.26)  with  the  weight  7 1 = 1/(1  + t).  We,  again,  use 
S in  (4.27)  and  the  exact  Hessian  H instead  of  an  approximation  H.  The  ( t + l)st 
SANR  update  is 


It  follows  from  equation  (4.29)  that  the  difference  between  the  ( t + l)st  SANR 
update  and  the  MLE  is 


say,  where  e**  ~ N( 0,  (1  — b)~2a2MC/[M(t  + 1)]).  Thus,  the  variance  of  e**  decreases 
as  the  number  of  iterations,  t,  increases.  This  implies  that  SANR  converges  for  a 
constant  Monte  Carlo  sample  size  M.  Figure  4.6  illustrates  this  point. 

Figure  4.6  shows  the  first  400  iterations  of  MCNR  and  SANR  in  the  OWMM, 
generated  according  to  equations  (4.28)  and  (4.29),  respectively.  The  MLE  is 
indicated  by  the  horizontal  line  in  each  plot.  The  first  plot  in  Figure  4.6  shows 
the  iteration  history  for  MCNR.  Clearly,  MCNR  does  not  converge  and  continues 
to  fluctuate  randomly  around  the  MLE  even  after  400  iterations.  The  second  plot 
shows  SANR.  Observe,  that  SANR  shows  some  variability  in  the  initial  iterations. 
However,  the  variability  decreases  rapidly  and  SANR  converges  to  the  MLE  after 
about  350  iterations. 


At(m)  = /x(t)  + — (A  - /i(t)  + e*). 
t | 1 


(4.29) 


(4.30) 
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Iteration 

Figure  4.6:  MCNR  and  SANR  in  the  OWMM. 

Convergence  Rate  of  Stochastic  Approximation.  A word  of  caution  is  necessary 
here.  In  more  general  models,  one  typically  has  to  pay  a price  for  convergence 
with  constant  M.  And  this  price  is  often  an  extremely  slow  rate  of  convergence. 
Consider  the  next  example  for  illustration. 

Example:  Beta-Binomial  Model.  Figure  4.7  illustrates  the  performance  of 
SANR  in  the  beta-binomial  model  for  different  weights  of  the  form 

lt  = * = 0,1,2,....  (4.31) 

For  four  different  values  of  A we  ran  SANR  for  200  iterations,  starting  each  run 
from  (a;(0);  = (1;20),  using  the  Monte  Carlo  Hessian  H in  (4.19)  and  a 

constant  Monte  Carlo  sample  size,  M = 100.  The  plots  show  only  the  iteration 
histories  for  the  ct-component  (thick  lines).  The  behavior  of  the  /5-component, 
however,  is  very  similar.  For  comparison,  we  also  included  the  iteration  history  for 
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Figure  4.7:  Rate  of  convergence  of  SANR  in  the  beta-binomial  model. 


one  run  of  MCNR  (thin,  dotted  lines)  from  the  same  starting  value  with  constant 
M = 100. 

Notice  that  for  the  value  A = 1 the  convergence  rate  of  SANR  is  extremely 
slow;  even  after  200  iterations  the  algorithm  is  still  far  from  the  MLE.  However, 
there  is  barely  any  random  noise  in  the  parameter  updates  in  this  case.  Conversely, 
as  A increases,  the  Monte  Carlo  error  grows  but,  at  the  same  time,  the  rate  of 
convergence  of  the  algorithm  improves.  In  particular,  as  A gets  larger,  SANR 
resembles  more  and  more  MCNR.  This  fact  reveals  the  conflict  that  the  user 
of  SANR  (and  also  other  stochastic  approximation  methods)  often  faces.  If  he 
chooses  weights,  7 1,  that  are  very  small  at  the  beginning  of  the  algorithm,  then  he 
sacrifices  a fast  convergence  rate  for  almost  no  Monte  Carlo  error.  Conversely,  if 
he  chooses  weights  that  are  large  at  the  initial  stages,  then  the  algorithm  reaches 
a neighborhood  of  the  MLE  quickly  at  the  price  of  a possibly  huge  Monte  Carlo 
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Figure  4.8:  Modifications  of  SANR  in  the  beta-binomial  model. 


error.  Clearly,  the  best  choice  of  7*  finds  a good  compromise  between  a fast 
convergence  rate  and  small  Monte  Carlo  error  variance.  However,  no  general 
guidelines  for  the  choice  of  jt  exist.  In  fact,  most  recommendations  are  somewhat 
ad  hoc  and  tailored  towards  a specific  model  and  set  of  data. 

Modifications  of  SANR.  We  pointed  out  in  Section  4.3.2  that  MCNR  can  have 
stability  problems  when  using  H in  (4.19)  together  with  a,  typically,  small  Monte 
Carlo  sample  size.  These  problems  carry  over  to  SANR.  Observe  that  in  Figure  4.7 
we  used  SANR  with  M — 100.  Often,  however,  it  is  desirable  to  use  SANR  with  a 
smaller  M in  order  to  increase  the  efficiency  of  the  algorithm.  Figure  4.8  shows  200 
iterations  of  SANR  with  M = 10  using  (1)  H (thick  dotted  lines);  (2)  HQ  in  (4.17) 
(thin  solid  lines);  and  (3)  Ht  in  (4.21)  (thin  dashed  lines). 

We  observe  that  SANR  breaks  down  when  using  H.  This  implies  that  an 
alternative  to  H should  be  used  when  the  objective  is  to  run  SANR  with  a small 
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M.  Figure  4.8  shows  that  both  of  the  alternatives,  Hq  as  well  as  H*,  stabilize  the 
algorithm.  Notice,  however,  that  using  fib  in  place  of  H slows  down  the  algorithm 
significantly,  especially  when  A = 1,  where  an  already  very  slow  rate  of  convergence 
for  SANR  is  drastically  reduced  by  using  Ht.  On  the  other  hand,  using  Hq  in 
place  of  H works  quite  well  and  the  results  are  very  similar  to  those  in  Figure  4.7 
(with  a larger  M).  However,  as  with  MCNR,  replacing  H by  Hq  results  in  an 
EM- type  algorithm  which  is  considered  in  detail  in  the  next  section. 

4.4.2  Stochastic  Approximation  EM  (SAEM) 

Delyon  et  al.  (1999)  propose  a stochastic  approximation  to  the  EM  algorithm. 
They  suggest  replacing  the  Q-function  in  the  MCEM  algorithm  with  a weighted 
average  of  Q-like  functions.  Specifically,  the  ( t + l)st  update  is  obtained  by 
maximizing 

Qt+ 1('0)  = (1  - 7 t)QtW  + 7t<2(V,lV>W),  (4.32) 

with  Q defined  in  equation  (4.9)  and  {7t}t>o,  a sequence  of  decreasing  weights  that 
satisfy  (4.23).  In  general,  70  = 1 and  Q0  = 0,  which  implies  that  the  first  SAEM 
update  is  identical  to  the  first  MCEM  update.  In  fact,  if  = 1,  Vi,  SAEM  reduces 
exactly  to  MCEM.  The  ( t + l)st  SAEM  update  is  typically  obtained  by  solving  the 
estimating  equation  Ft+i(tp)  = 0,  where 

Ft+i('0)  = (1  ~ 7 t)Ft(^)  + 7t F(V>,  '0(t)),  (4.33) 

and  F is  defined  in  (4.10).  Delyon  et  al.  establish  that  SAEM  converges  to  a local 
maximum  of  the  likelihood  function  for  a fixed  Monte  Carlo  sample  size,  M . 

Example:  OWMM.  Consider  the  OWMM  and  assume  that  the  Monte  Carlo 
sample  size  M is  fixed  for  all  iterations.  It  follows  from  equation  (4.11)  that  the 
difference  between  the  (t  + l)st  MCEM  update  and  the  MLE  is  given  by 


/*(*+!)  _ A = 6<+ V0)  - A)  + ef, 


(4.34) 
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where 


(4.35) 


with  e,  and  a2MC  defined  in  (4.12).  Notice  that  bt+1(n (°)  - /})  ->  0 (as  t -¥  oo)  since 


and  hence  MCEM  does  not  converge  for  fixed  Monte  Carlo  sample  size. 

Now  consider  SAEM.  One  can  show  (see  Section  7.1  for  more  details)  that  for 
weights  of  the  form  = 1/(1  + t),  the  difference  between  the  (t  + l)st  SAEM 
update  and  the  MLE  is 


Notice  that  for  large  t,  T(t  + 1 + b)/T(t  + 2)  ~ £6-1  (see  Abramowitz  and  Stegun, 
1992,  p.257).  Thus,  since  0 < b < 1,  ftt+1  —>  0 (as  t — > oo).  Moreover,  the  variance 
of  e*  vanishes  as  t increases.  Thus,  in  contrast  to  MCEM,  SAEM  converges  for  a 
constant  Monte  Carlo  sample  size. 

Example:  Beta-Binomial  Model.  Figure  4.9  illustrates  SAEM’s  convergence 
rate  for  the  beta-binomial  model  with  weights,  7*,  of  the  form  (4.31).  The  plots 
show  the  first  200  iterations  of  SAEM  (thick  lines)  run  with  a constant  Monte 
Carlo  sample  size,  M = 10,  per  iteration,  for  A € {1;  5;  10;  20}.  For  comparison, 
we  included  MCEM  (thin,  dotted  lines)  run  with  the  same  constant  M . Clearly, 
the  smaller  the  weights  are  at  the  early  stages  of  the  algorithm,  the  faster  SAEM 
reduces  Monte  Carlo  error.  However,  at  the  same  time,  the  algorithm’s  convergence 
rate  deteriorates  significantly  with  smaller  values  of  A.  As  with  SANR,  there  is 


0 < b < 1.  However,  the  variance  of  e*  in  (4.35)  does  not  decrease  for  constant  M 


- ji  - 6,+i(fi(0)  - it)  + e* 


(4.36) 


where 


,,+1  fi 1 + i r (i  + 2)r(6) 


-i— r b + i T(t  + 1 + 6) 


(4.37) 


(4.38) 
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Figure  4.9:  Rate  of  convergence  of  SAEM  in  the  beta-binomial  Model 


a trade-off  between  a fast  rate  of  convergence  and  an  efficient  Monte  Carlo  error 
reduction. 

Stopping  Rules  for  Stochastic  Approximation  Algorithms.  The  appeal  of 
stochastic  approximation  methods  (like  SANR  or  SAEM)  is  that  they  reduce 
Monte  Carlo  error  and  thus  converge  with  constant  Monte  Carlo  sample  size. 

We  have  seen  that  a fast  error  reduction  typically  results  in  a very  slow  rate  of 
convergence.  Thus,  differences  between  successive  parameter  updates  are  very  small 
even  when  the  algorithm  is  far  from  convergence.  Since  most  common  stopping 
rules  are  based  on  the  detection  of  small  differences  between  successive  iterations, 
the  application  of  these  rules  to  stochastic  approximation  algorithms  will  often  lead 
to  a premature  decision  to  stop. 

Figure  4.10  shows  what  happens  when  applying  the  stopping  rule  in  (4.14) 
(using  the  parameters  Si  - 0.001,  52  = 0.001  and  C = 3)  to  SAEM.  The  end 
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Figure  4.10:  Convergence  of  SAEM  in  the  beta-binomial  Model 


of  the  solid  lines  indicate  the  points  at  which  SAEM  stops  due  to  (4.14).  The 
dashed  lines  indicate  the  continuation  of  SAEM,  had  the  stopping  rule  not  been 
used.  We  observe  that  for  each  of  the  four  runs,  SAEM  stops  prematurely;  that 
is,  SAEM  clearly  stops  far  from  the  MLE  and  before  eliminating  the  Monte  Carlo 
error  entirely. 

Figure  4.10  clearly  shows  that  stopping  rules  of  the  form  (4.14)  are  inappropri- 
ate for  stochastic  approximation  procedures.  However,  it  is  not  clear  at  all  whether 
there  exist  alternative  rules,  suitable  for  slow-converging  algorithms  like  SAEM.  In 
fact,  it  is  very  common  to  base  the  decision  to  stop  on  an  inspection  of  a plot  of 
the  parameter  updates  versus  the  iteration  number.  However,  this  approach  cer- 
tainly does  not  automate  the  implementation  of  these  methods  and,  furthermore, 
since  stochastic  approximation  procedures  tend  to  have  very  flat  iteration  histories, 
it  will  also  often  lead  to  wrong  conclusions. 


CHAPTER  5 

EFFICIENCY  OF  MONTE  CARLO  EM  AND  SIMULATED  MAXIMUM 

LIKELIHOOD 

Unlike  analytical  or  numerical  approximation,  the  error  of  Monte  Carlo 
based  approximation  methods  can  typically  be  made  arbitrarily  small  by  simply 
increasing  the  Monte  Carlo  sample  size.  This  implies  that  the  researcher  can, 
at  least  in  principle,  control  the  accuracy  of  the  estimates  by  choosing  the  total 
simulation  amount  appropriately.  However,  in  practice  this  approach  is  often  not 
feasible.  Time  limitations  and  computing  power  often  restrict  the  total  number  of 
simulations,  making  it  desirable  to  use  the  available  simulations  most  efficiently. 

In  particular,  having  several  competing  estimation  methods  to  choose  from,  it  is 
desirable  to  select  that  method  that  estimates  the  parameter  most  efficiently. 

In  this  chapter  we  investigate  the  efficiency  of  MCEM  and  SML.  The  efficiency 
of  a Monte  Carlo  method  can  be  measured  by  the  magnitude  of  its  Monte  Carlo 
error  variance.  A method  with  a large  variance  will  need  a larger  total  simulation 
amount  to  reach  the  same  accuracy  as  a method  with  a smaller  variance.  Mc- 
Culloch (1997,  p.165)  conducted  an  empirical  study  to  investigate  the  variance  of 
MCEM  and  SML  in  the  case  of  the  logistic-normal  model.  He  found  that  SML 
“performs  poorly  [...],  showing  a very  large  variance”  compared  to  MCEM.  In  this 
chapter,  we  investigate  the  Monte  Carlo  error  of  MCEM  and  SML  analytically.  In 
particular,  using  first  and  second  order  approximations  we  derive  the  Monte  Carlo 
errors  of  MCEM  and  SML  for  the  class  of  GLMMs.  We  show  that  the  variance  of 
SML  is  unbounded  relative  to  that  of  MCEM  and  use  this  result  to  conclude  that 
MCEM  is  a more  efficient  method  in  many  applications  of  GLMMs. 


68 


69 


This  chapter  is  organized  as  follows.  In  Section  5.1  we  derive  the  asymptotic 
Monte  Carlo  standard  errors  of  the  MCEM  and  SML  estimates  and  discuss 
the  implications  of  the  results.  In  Section  5.2  we  illustrate  our  results  with  two 
examples.  We  conclude  this  chapter  by  pointing  out  practical  limitations  to  the  use 
of  SML. 


5.1  Efficiency  of  MCEM  and  SML  in  GLMMs 
5.1.1  Asymptotic  Monte  Carlo  Error 

Consider  the  GLMM  defined  in  Section  2.1.1.  We  will  assume  that  the 
variance  components  and  cr\  are  known  and  that  we  are  only  interested  in 
estimating  0 (i.e.  i ft  = 0).  This  assumption  favors  SML  since  the  importance 
sampling  distribution  is  exact  and  approximation  (4.8)  can  be  used.  Suppressing 
the  dependence  on  cr$  and  erf,  we  will  write  /(y|u;/3)  = /(y|u;  0,  a$)  and 
g{ u)  = g( u;  aj)  for  the  conditional  density  of  the  data  and  the  marginal  density  of 
the  random  effects,  respectively. 

In  the  following  we  derive  the  asymptotic  distribution  of  a one-step  MCEM 
iteration  at  the  MLE,  that  is,  given  that  the  current  parameter  estimate  equals  the 
MLE,  we  derive  the  asymptotic  distribution  of  the  MCEM  update.  Let  /3(f)  be  the 
tth  MCEM  update  and  the  corresponding  deterministic  EM  update.  Notice 
that  EM  does  not  move  from  the  MLE,  0.  Therefore,  if  = 0,  then  (3*®  = 0, 
and  the  difference,  0^  — 0,  is  pure  Monte  Carlo  error.  Without  loss  of  generality 
we  may  take  t — 1.  Notice  that  F(0,0)  in  (4.10)  is  an  average  of  the  i.i.d.  variates 
Xk((3)  = <91og/(y,  u(1’fc);/3)/<9/3  with  mean  £[Xi(/3)|y;  0\  = S(/3)  = 0 and  variance 
Var[Xi(/3)|y;/3]  = i,  where 


I = £ 


d__ 

30 


log/(y,u;/3) 


3_ 

00 


log/(y,ii;/3) 


y,0 


(5.1) 


/3=0 
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Thus  by  the  Central  Limit  Theorem,  y/M  F{0,0)  ->■  J\f( 0,  i).  Notice  furthermore 
that  ¥(0^,0)  = 0.  A first  order  Taylor  expansion  of  F(/3(1\^3)  about  0 together 
with  the  Law  of  Large  Numbers  and  Slutzky’s  Theorem  reveals  that 


lMCEM 


-fajL+  jV(0,  rlCEJ,  as  M ->  oo, 


(5.2) 


where  rj, CEM  = J xij  1 and 


3 = E 


8 2 


d/3d(3 


7 log /(y  > u;  y3) 


y;/3 


(5.3) 


Notice  that  r^CEM  quantifies  the  Monte  Carlo  error  of  MCEM  near  the  MLE. 

In  similar  fashion,  a first  order  Taylor  expansion  of  8L{0\y)  / 80  in  (4.8)  about 
(3  shows  that 


^sml  = V M (0  — /3)  7 A/" (0,  TgML),  as  M — > oo,  (5-4) 


with  TgML  = J 1IJ  1 and 


i 

j 


E 

E 


8_ 
00 
8 2 


8080 


/(y|u; 0)^j  (Jjpfi y|u;^)^ 
?/( ylu;/3) 


p=/3 


(5.5) 

(5.6) 


where  the  expectations  in  (5.5)  and  (5.6)  are  with  respect  to  the  marginal  random 
effects  density,  <?(u). 

Let  W be  the  diagonal  matrix  of  iterative  GLM  weights,  wu  — 1 /{a,-V(//j) 
g'(pi)2},  where  V(jt/j)  = b"(6i)  is  the  variance  function  for  (2.1),  and  let  W be 
the  value  of  W evaluated  at  the  mode,  u,  of  /( y,  u;  0).  We  show  in  the  Appendix 
(Section  A. 5)  that  |r^CEM|,  the  generalized  variance  of  AMCEM,  is  proportional  to 
|X|,  where 


S = [Z'WZ  + G"1]-1. 


(5.7) 
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Assume  that  the  random  effects  covariance  matrix  G is  positive  definite  and  let 
Amax  and  Amin  be  its  largest  and  smallest  eigenvalue,  respectively.  Observe  that 
|S|  — ► 0,  as  Amax  approaches  zero.  This  is  the  case  when  all  the  effects  variance 
components  are  essentially  zero  and  the  mixed  model  reduces  to  a fixed  effects 
model.  Conversely,  |£|  — > |Z'WZ|-1,  as  Amin  diverges  to  infinity;  that  is,  as  the 
variability  of  the  random  effects  increases  without  bound. 

To  illustrate,  consider  the  OWMM  with  m groups  and  r observations  per 
group.  Let  G = a\ Im,  W = (l/oo)In,  and  Znxm  = Im  <g>  lr,  where  n = m x r.  It 
follows  that 


where  c(af)  = al/{r  + (cg/of)).  With  er2(>  0)  fixed,  c(af)  approaches  0 as  o\  — >■  0. 
On  the  other  hand,  as  of  — » oo,  c(af)  converges  to  crl/r.  Therefore,  |E|  is  bounded 
between  0 and  ( o\/r)m . 

This  implies,  that  regardless  of  the  magnitude  of  the  random  effects  variance 
components  in  G,  the  generalized  variance  of  AMCEM  has  an  upper  bound,  which  is 
determined  by  the  model  and  the  data. 

In  contrast,  we  also  show  in  the  Appendix  (Section  A. 5),  that  |rgML|  = 

Op  (|G|1//2).  Hence  |r2ML|  is  unbounded  as  a function  of  G.  Indeed,  as  the  compo- 
nents of  G can  become  arbitrarily  large,  the  generalized  variance  of  ASML  does  not 
have  an  upper  bound.  In  Section  5.2  we  discuss  two  examples,  both  of  which  illus- 
trate the  superior  efficiency  of  MCEM  relative  to  SML  when  the  effects  variance  is 
large. 

5.1.2  Discussion 

The  generalized  variances  of  AMCEM  and  ASML  imply  that,  if  the  random  effects 
variance  components  in  G are  small,  both,  MCEM  and  SML,  will  have  small 
Monte  Carlo  error  variances.  But  as  the  effects  variance  components  approach  zero, 
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the  mixed  model  reduces  to  a fixed  effects  model.  Thus,  the  interesting  practical 
situations  are  when  one  or  more  of  the  effects  variance  components  are  large.  Our 
results  imply  that  the  Monte  Carlo  variance  of  SML  is  unbounded  relative  to  that 
of  MCEM  as  the  magnitude  of  the  effects  variance  components  increase.  Therefore, 
when  there  is  more  variability  in  the  data  than  can  be  accounted  for  by  a fixed 
effects  model,  MCEM  is  generally  a more  precise  and  efficient  method  than  SML. 

A heuristic  explanation  of  this  result  is  that  SML  involves  sampling  from  the 
( marginal ) distribution  of  the  random  effects  u which  has  variance  Var(u)  = G. 

If  the  sampling  variance  of  a simulation-based  estimation  method  is  large,  then 
parameter  estimation  is  done  in  a population  with  large  heterogeneity.  Precise 
estimation  in  a highly  heterogeneous  population  will  in  general  be  difficult.  Indeed, 
as  G is  unbounded  (component- wise),  the  sampling  variance  of  SML  can  be 
arbitrarily  large.  This  partially  explains  the  erratic  behavior  of  SML,  which  was 
also  observed  by  McCulloch  (1997). 

Conversely,  note  that  the  sampling  variance  of  MCEM  is  bounded.  Indeed,  in 
every  iteration  MCEM  draws  a random  sample  from  the  conditional  distribution, 
u|y.  Booth  and  Hobert  (1998,  Eq.12  & 14)  derive  Var(u|y)  « S,  which  is  exact 
in  the  LMM.  But  as  £ is  bounded  (with  respect  to  G),  MCEM  samples  from  a 
population  with  finite  and  bounded  variance,  making  a precise  estimation  more 
likely.  This  helps  to  explain  the  superior  performance  of  MCEM. 

Our  derivations  of  the  asymptotic  variances  r^CEM  and  TgML  (given  in  the 
Appendix)  involve  Laplace  approximations  whose  accuracy  depends  on  both, 
sample  size,  n,  and  the  effects  variance,  G.  However,  our  goal  is  to  gain  insight 
into  the  relative  efficiencies  of  MCEM  and  SML  in  terms  of  computational  effort. 
Both  methods  can,  in  principle,  achieve  any  level  of  accuracy  by  simply  increasing 
the  Monte  Carlo  sample  size  M.  In  contrast,  analytical  approximations  such  as 
PQL  (Breslow  and  Clayton,  1993)  have  a fixed  level  of  accuracy  determined  by 
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Figure  5.1:  Variability  of  MCEM  and  SML  for  increasing  variance  components. 

the  model  and  the  data.  Since  our  derivations  are  exact  for  the  LMM,  one  can 
expect  the  Laplace  approximations  to  improve  the  more  closely  the  GLMM  can  be 
approximated  by  a LMM.  Finally,  we  note  that  our  derivations  assume  the  variance 
components  are  known.  The  examples  in  Section  5 indicate  that  our  analytical 
results  qualitatively  predict  the  relative  performance  of  the  two  algorithms  in  the 
unknown  variance  case  also. 

5.1.3  Simulation  Study 

Consider  the  OWMM  for  illustration.  The  OWMM  is  a very  simple  example 


of  a GLMM,  in  which  the  MLE  of  /r  is  known,  making  the  use  of  computationally 
intensive  methods  like  MCEM  and  SML  unnecessary.  However,  because  of  its 
simplicity,  this  model  is  very  well  suited  to  illustrate  our  results.  Moreover,  a 
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method  failing  in  this  simple  model  is  unlikely  to  perform  well  in  more  complicated 
models. 

Consider  Figure  5.1.  For  5 effects  variance  components,  of  E {0.05;  0.5;  1;  2;  3}, 
5 different  sets  of  data  were  simulated  from  the  OWMM  (with  fixed  parameters 
m = 10,r  = 10,//  = 0 and  of  = 1).  The  parameter  n was  estimated  repeatedly 
in  each  data  set,  50  times  with  MCEM  (using  50  iterations  and  a constant  Monte 
Carlo  sample  size  of  M = 10  per  iteration)  and  50  times  with  SML  (using 
M = 500,  making  the  total  simulation  amount  of  both  methods  equal).  Box-plots 
of  the  estimates  for  n are  shown  for  (a)  MCEM  and  (b)  SML. 

Figure  5.1  displays  the  variability  of  MCEM  and  SML  for  increasing  values 
of  of.  The  variability  of  both  methods  is  very  small  when  of  is  almost  zero 
(erf  = 0.05).  This  is  the  case  where  the  mixed  model  essentially  reduces  to  a fixed 
effects  model  and  both  methods  perform  very  well. 

On  the  other  hand,  as  of  increases,  the  variability  of  MCEM  initially  grows, 
but  remains  almost  constant  for  of  > 1.  Conversely,  the  variability  of  SML 
continues  to  increase  with  every  increase  in  of  illustrating  that  the  variability  of 
SML  is  unbounded  relative  to  that  of  MCEM. 

5.2  Two  Examples 

The  following  two  examples  support  the  analytical  results  from  Section  5.1.  In 
addition,  they  also  provide  evidence  that  these  results  qualitatively  extend  to  the 
more  general  case  of  estimating  the  complete  parameter  vector  xp  = (/ 3,  a 2). 

The  example  in  Section  5.2.1  is  based  on  the  Mississippi  River  data  presented 
in  Littell  et  al.  (1996,  Ch.4.2).  In  Section  5.2.2  we  consider  a data  set  generated 
according  to  a model  that  McCulloch  (1997)  used  to  illustrate  MCEM  and  SML. 


Table  5.1:  Nitrogen  Concentrations  (in  parts/million)  at  six  randomly  selected 
influents  to  the  Mississippi  River. 
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Influent  1 

Influent2 

Influent3 

Influent4 

Influent5 

Influent6 

21 

21 

20 

14 

7 

41 

27 

11 

19 

24 

15 

42 

29 

18 

20 

30 

18 

35 

17 

9 

11 

21 

4 

34 

19 

13 

14 

31 

28 

30 

12 

23 

27 

29 

2 

20 

20 


Table  5.2:  Relative  Efficiency  of  MCEM  and  SML  for  Mississippi  data. 


Average 

Empirical  Variance 

c2 

„2 

Sa\ 

MCEM 

21.080 

51.294 

0.0051 

0.2205 

SML 

21.225 

51.624 

1.6710 

42.192 

Relative  Efficiency 

p 327.65 

191.35 

5.2.1  Mississippi  River  Data 

Table  5.1  presents  data  on  nitrogen  concentrations  from  several  sites  at  six 
randomly  selected  influents  to  the  Mississippi  River.  Littell  et  al.  (1996)  propose 
an  unbalanced  version  of  the  OWMM  in  equation  (2.5)  to  fit  this  data.  To  be  more 
specific,  let  r*  be  the  number  of  observations  at  influent  z,  (z  = 1, . . . , 6),  then  they 
fit 

Vij  = n + Ut  + Cij,  j = 1, . . . , n;  * = 1,  — ,6,  (5.8) 

where  the  zq’s  and  ei:;’s  are  random  samples  from  iV(0,  of)  and  N(0,ctq),  respec- 
tively. Littell  et  al.  obtain  the  MLEs  /z  = 21.22,  &l  = 42.7  and  o\  — 51.25. 

In  order  to  measure  the  efficiency  of  MCEM  and  SML,  we  applied  each 
method  to  the  data  200  times,  estimating  /z  and  of  (and  treating  of  as  fixed 
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and  known).  In  each  replicate,  MCEM  was  run  for  20  iterations  with  a constant 
Monte  Carlo  sample  size  of  M — 1000.  20  iterations  were  easily  enough  for  the 
deterministic  EM  algorithm  to  converge.  On  the  other  hand,  in  each  replication 
SML  (using  an  importance  sampling  distribution  with  of  = of)  was  applied  with 
M — 20,  000,  making  the  total  simulation  amount  for  both  methods  equal.  Table 

5.2  presents  the  results.  The  average  value  of  fi  over  the  200  replications  is  denoted 
p and  the  variance  of  those  200  values  of  p is  denoted  s2.  The  corresponding 
quantities  for  of  are  denoted  by  of  and  s2?  • 

Table  5.2  has  some  interesting  features.  Notice  first,  that  for  each  of  the  two 
methods  the  variances  for  of  are  much  larger  than  for  p.  This  indicates  additional 
difficulties  when  estimating  variance  components.  Furthermore,  for  each  parameter 
we  have  computed  the  relative  efficiency,  p,  defined  as  the  ratio  of  the  SML  and 
MCEM  variances.  With  the  total  simulation  amount  (M  = 20,  000)  equal  for  both 
methods,  the  efficiency  gain  of  MCEM  ranges  between  p = 327.65  and  p = 191.35 
(for  p and  of,  respectively).  This  means  that  MCEM  is  at  least  191  times  more 
efficient  in  using  the  total  simulated  data  than  SML. 

5.2.2  McCulloch’s  Model 

McCulloch  (1997)  considers  a logistic-normal  model,  similar  to  the  one 
described  in  Section  2.1.1.  Specifically,  suppose  that  the  y ^ are  conditionally 
independent  with 


The  Ui  are  assumed  to  be  a random  sample  from  the  N(0,a2)  distribution.  McCul- 
loch used  data  that  were  simulated  according  to  this  model  with  m = 10,  r = 15, 


VijWi  Bernoulli(Trjj) 


for  i = 1, ...,  m and  j — 1, ...,  r,  where 
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P = 5,  a2  = .5  and  Xij  = j / 15,  but  did  not  report  the  data.  Booth  and  Hobert 
(1999)  generated  data,  displayed  in  Table  5.3,  using  the  same  settings,  and  found 
the  exact  MLE  to  be  0,  a2)  = (6.132, 1.766),  based  on  numerical  integration.  We 
refer  to  this  data  set  as  the  “original  data”.  We  also  generated  a second,  new  set 
of  data,  displayed  in  Table  5.4,  with  the  same  settings,  which  resulted  in  a MLE  of 
0,  a2)  — (3.526,0.270).  We  denote  this  data  set  as  the  “new  data”. 

As  in  Section  5.2.1,  for  each  data  set  we  estimated  ft  and  a2  repeatedly  (200 
times)  using  each  of  the  two  methods,  MCEM  (20  iterations,  M = 1000)  and  SML 
( M = 20,000),  resulting  in  an  equal  total  simulation  amount  for  both  methods. 

Consider  Tables  5.5  and  5.6.  Notice,  that  for  the  original  data  the  efficiency 
gain  of  MCEM  is  at  least  p = 15.25.  However,  for  the  new  data,  the  two  methods 
are  almost  equally  efficient.  This  illustrates,  that  for  a small  variance  component 
(d2  = 0.270  for  the  new  data),  SML  performs  very  well  relative  to  MCEM. 
However,  as  the  effects  variance  component  increases,  the  efficiency  of  SML  rapidly 
declines. 

5.2.3  Practical  Limitations  of  SML 

In  the  context  of  these  two  examples,  we  want  to  point  out  some  practical 
limitations  and  difficulties  associated  with  SML.  They  apply,  however,  to  all 
examples  that  we  have  considered  in  the  framework  of  GLMMs. 

We  have  found  that  SML  is  in  practice  hard  to  implement.  One  reason  is  that 
it  requires  a numerical  method  such  as  Newton-Raphson  to  find  the  maximum 
of  the  simulated  likelihood  function.  However,  numerical  maximization  routines 
(and  in  particular  Newton-Raphson)  often  suffer  from  the  need  for  good  starting 
values.  In  our  examples,  we  found  it  necessary  to  use  starting  values  very  close  to 
the  true  MLEs.  Using  the  same  starting  values  as  for  MCEM,  which  were  rather 
arbitrarily  set  to  (/A°\  af^)  = (0, 1),  always  lead  to  convergence  problems  for  SML. 
This  is  a serious  disadvantage  for  SML,  as  it  requires  some  prior  knowledge  about 
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Table  5.3:  Original  data  according  to  McCulloch’s  model 


i 

Values  for  the  following  values  of  j 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

1 

0 

0 

0 

0 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

2 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

0 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

5 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

0 

1 

1 

1 

1 

6 

0 

0 

0 

1 

0 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

7 

0 

1 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

8 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

9 

1 

0 

0 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

10 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

Table  5.4: 

New  data  according  to  McCulloch’s  model 

i 

Values  for  the  following  values  of  j 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

1 

1 

1 

0 

1 

1 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

2 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

0 

1 

1 

1 

1 

4 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

5 

1 

0 

1 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

6 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

7 

1 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

0 

8 

0 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

0 

1 

1 

9 

0 

0 

1 

1 

0 

0 

0 

1 

1 

1 

1 

0 

1 

1 

1 

10 

1 

0 

0 

0 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 
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Table  5.5:  Relative  Efficiency  of  MCEM  and  SML  for  McCulloch’s  model  (1) 


Original  Data  (/?,  a1 

')  = (6.132,1.766) 

Average 

Empirical  Variance 

P <72 

o2 

sp 

o2 

MCEM  6.1200  1.7531 

0.0065 

0.0100 

SML  6.0975  1.6988 

0.2325 

0.1525 

Relative  Efficiency  p 

35.769 

15.250 

Table  5.6:  Relative  Efficiency  of  MCEM  and  SML  for  McCulloch’s  model  (2) 


New  Data  (p,a2)  = (3.526,0.270) 


Average  Empirical  Variance 


P 

a2 

e2 

SP 

s2 

MCEM 

3.5310 

0.3094 

0.0008 

0.0008 

SML 

3.5267 

0.2470 

0.0021 

0.0011 

Relative  Efficiency 

p 2.6250 

1.3750 
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the  solution  at  the  outset.  On  the  other  hand,  using  “trial-and-error”  to  find  good 
starting  values  can  be  very  time  consuming  and  is  increasingly  difficult  in  higher 
dimensional  problems.  Conversely,  MCEM  is  very  stable  and  generally  converges 
even  with  starting  values  chosen  randomly  in  the  parameter  space. 

5.2.4  More  Efficient  Use  of  MCEM 

We  also  want  to  emphasize  that  we  have  not  used  MCEM  in  its  most  efficient 
way  in  our  examples.  For  simplification,  we  used  MCEM  with  a constant  Monte 
Carlo  sample  size  M in  each  iteration.  However,  it  is  generally  inefficient  to  use 
MCEM  with  a constant  M in  every  iteration.  Using  smaller  sample  sizes  in  earlier 
iterations  and  increasing  the  sample  sizes  as  the  algorithm  moves  along  leads  to 
a more  efficient  use  of  the  total  simulation  amount.  This  means  that  the  Monte 
Carlo  error  in  the  last  iteration  will  be  significantly  smaller  than  in  our  examples, 
making  its  superiority  over  SML  even  more  apparent. 


CHAPTER  6 

EFFICIENCY  IMPROVEMENT  WITH  QUASI-MONTE  CARLO 


So  far  we  have  considered  stochastic  estimation  based  on  classical  Monte 
Carlo  methods;  that  is,  based  on  methods  that  use  random  points  to  evaluate  an 
intractable  integral.  In  this  chapter  we  investigate  methods  using  non-random 
points.  Specifically,  we  consider  methods  that  are  based  on  the  ideas  of  Monte 
Carlo  and  which  use  deterministic  sequences  of  points;  in  particular,  sequences  of 
points  which  are  more  uniformly  spread  in  the  sampling  space  than  random  points. 
The  advantage  of  using  these  deterministic  sequences  is  that  the  corresponding 
estimators  often  have  a smaller  variance  which  results  in  a more  efficient  use  of 
the  simulated  data.  Methods  that  are  based  on  deterministic  sequences  are  often 
classified  as  Quasi-Monte  Carlo  methods. 

Quasi-Monte  Carlo  methods  are  relatively  unpopular  in  statistics.  Although 
there  exist  a variety  of  introductions  and  review  articles  of  these  methods  (see, 
e.g.,  Shaw,  1988;  Fang  and  Wang,  1994;  Fang  et  ah,  1994;  Owen,  1998),  the  ideas 
of  Quasi-Monte  Carlo  are  not  very  often  applied  to  practical  statistical  problems. 
Indeed,  we  found  relatively  few  articles  that  make  use  of  Quasi-Monte  Carlo 
(Niederreiter  and  Peart,  1986;  Tezuka  and  Fushimi,  1992;  Carletti  et  ah,  1994; 
Ostland  and  Yu,  1997;  Pan  and  Thompson,  1998;  Liao,  1998).  This  is  somewhat 
surprising  since  this  method  has  been  used  very  successfully  in  other  fields,  for 
example  in  finance  for  the  evaluation  of  high  dimensional  integrals  occurring  in 
financial  derivatives  (see  Paskov  and  Traub,  1995;  Paskov,  1997;  Morokoff  and 
Caflisch,  1998). 
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In  Section  6.1  we  introduce  the  basic  concepts  of  Quasi-Monte  Carlo,  show 
how  to  construct  deterministic  sequences  that  are  more  uniformly  spread  than 
random  points  and  point  out  advantages  over  classical  Monte  Carlo  methods. 

In  Section  6.2  we  apply  Quasi-Monte  Carlo  to  the  GHMs  and  investigate  how 
Quasi-Monte  Carlo  can  be  used  to  improve  the  efficiency  of  SML. 

6.1  Quasi-Monte  Carlo  Integration 
Suppose  we  want  to  evaluate  an  (analytically  intractable)  integral 

1=  [ f(x)dx  (6.1) 

Jcd 

over  the  d-dimensional  unit  cube , Cd  = [0,  l]d.  If  d is  small,  then  the  integral  in 
(6.1)  is  typically  evaluated  using  standard  (deterministic)  quadrature  methods; 
however,  quadrature  is  no  longer  recommended  when  d is  large.  When  d is  large, 
(6.1)  is  often  evaluated  using  Monte  Carlo  integration. 

For  a set  of  points,  {x1?  C Cd,  classical  Monte  Carlo  integration 

approximates  (6.1)  by  the  empirical  average 

1 M 

(6-2) 

t=l 

where  the  points  x,  are  selected  randomly  in  the  unit  cube,  that  is,  x*  ~ 

UnifQO,  l]d). 

One  criticism  of  classical  Monte  Carlo  integration  is  that  its  ideas  are  based 
on  the  ability  to  produce  random  points  x,.  However,  in  reality  random  points  are 
typically  not  available.  In  fact,  in  most  cases,  the  points  x,  are  produced  with  the 
help  of  a random  number  generator;  but  since  random  number  generators  are  based 
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Figure  6.1:  100  random  points  in  the  unit  square. 

on  deterministic  algorithms  1 , these  points  cannot  be  considered  genuinely  random 
and  are  therefore  often  referred  to  as  “pseudo” -random  points. 

Another  criticism  is  that  random  numbers  typically  do  not  explore  the  sample 
space  very  well.  Consider  Figure  6.1,  for  instance,  which  shows  100  random  points 
of  the  unit  square,  x*  ~ Unif([0,  l]2),  i = 1, . . . , 100.  Random  points  tend  to  form 
clusters,  “over-sampling”  the  unit  square  in  some  places;  this  leads  to  gaps  in  other 
places,  where  the  sample  space  is  not  explored  at  all. 

Quasi-Monte  Carlo  methods  avoid  these  points  of  criticism.  Although  the  ba- 
sic ideas  of  Monte  Carlo  and  Quasi-Monte  Carlo  integration  are  very  similar,  there 
is  one  important  difference:  instead  of  using  random  points,  Quasi-Monte  Carlo 


1 See,  for  example,  Robert  and  Casella  (1999,  Chapter  2)  for  a more  thorough 

discussion  of  this  issue. 
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Figure  6.2:  100  points  of  a low  discrepancy  sequence  in  the  unit  square. 

uses  a deterministic  sequence  of  points  which  explores  the  sample  space  better  than 
random  points.  These  sequences  are  often  referred  to  as  low  discrepancy  sequences. 

6.1.1  Low  Discrepancy  Sequences 

Quasi-Monte  Carlo  uses  deterministic  sequences  of  points  which  provide 
a better  spread,  or  “uniformity”,  in  the  sampling  space,  avoiding  the  gaps  and 
clusters  that  arise  from  random  sampling.  In  order  to  quantify  the  notion  of 
uniformity,  one  defines  a distance  measure,  the  discrepancy.  Several  different 
distance  measures  exist;  the  most  widely  studied  measure  is  the  star  discrepancy 
which  is  defined  as  follows  (see,  e.g.,  Fang  et  al.,  1994;  Morokoff  and  Caflisch, 

1995). 

Let  R be  a rectangle  contained  in  Cd  with  sides  parallel  to  the  coordinate  axes, 
and  let  A (R)  denote  the  Lebesgue  measure  of  R.  The  star  discrepancy,  D*M,  for  the 
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sequence  {xi, . . . , xM}  is 


D*m  = sup 

RCCd 


# of  points  Xj  in  R 


M 


-MR) 


(6.3) 


Quasi-Monte  Carlo  methods  use  sequences  of  points  with  the  smallest  possible 
discrepancy.  One  can  show  that  for  a set  of  random  points  the  star  discrepancy 
has  asymptotic  order  of  0P((loglogM)/M1/2).  On  the  other  hand,  there  exist 
many  deterministic  sequences  that  have  smaller  asymptotic  order.  The  star  dis- 
crepancy of  the  best-known  deterministic  sequences  have  asymptotic  order  of 
0((logM)d/M);  these  sequences  are  called  low  discrepancy  sequences.  The  asymp- 
totic order  suggests  that  low  discrepancy  sequences  have  greater  uniformity  than 
random  sequences.  Notice,  however,  that  for  large  d it  could  take  (impractically) 
large  sample  sizes,  M,  before  the  asymptotics  are  relevant.  However,  empirical 
studies  suggest  that  Quasi-Monte  Carlo  can  be  more  accurate  than  Monte  Carlo 
on  some  real  problems  with  practical  sample  sizes  (see,  e.g.,  Morokoff  and  Caflisch, 
1995). 


Figure  6.2  shows  100  points  of  a low  discrepancy  sequence  in  the  unit  square. 
Compared  to  Figure  6.1,  these  points  avoid  clustering  and  provide  a more  uniform 
spread  than  random  points. 

There  exist  many  different  low  discrepancy  sequences.  Examples  include 
the  Halton  sequence  (Halton,  1960),  the  Sobol  sequence  (Sobol,  1967),  the  Faure 
sequence  (Faure,  1982),  and  the  Niederreiter  sequence  (Niederreiter,  1992).  In  this 
work  we  will  focus  on  the  Halton  sequence  only. 


6.1.2  Halton  Sequences 

Let  6,  b > 2,  be  an  integer.  Then  any  integer  n,  n > 0,  can  be  written  in  base-5 
representation  as 

n = djV  -T  dj—iV  * + • • • -f-  d\b  + do, 
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where  d,  £ {0, 1, . . . , b — 1}  for  i — 0, 1, . . . , j.  For  example,  if  b = 7,  we  can  write 
the  integer  n = 101  as  n = 2 • 72  + 3 ■ 7°. 

The  radical  inverse  function  (j>b(n)  of  n to  the  base  b is  defined  as 


For  example,  for  the  base  b — 7 and  n = 101,  the  radical  inverse  function  applied  to 
n gives  07(1O1)  = 3/7  + 2/73.  Notice  that  <j>b(-)  maps  every  integer  onto  an  element 
of  the  unit  interval;  that  is,  for  every  integer,  n > 0,  4 i6(n ) £ [0, 1]. 

The  radical  inverse  function  defines  the  elements  of  a Halton  sequence.  If 
we  let  bi, ...,  bd  be  d different  prime  integers  that  are  greater  than  one,  then  a d- 
dimensional  Halton  sequence  is  given  by  {x0, ...,  xM_i}  C Cd,  where  the  elements 
xfc  are  defined  by 


In  practice,  the  integers  bi, ...,  bd  are  often  chosen  to  be  the  first  d prime  numbers. 

Notice  that  we  do  not  have  to  start  the  sequence  (6.4)  at  the  origin,  k — 0.  For 
any  d-vector  of  integers,  say  m = (mi, ...,  md)',  nii  > 0,  the  sequence  defined  by 


is  still  a low  discrepancy  sequence  (see,  e.g.,  Pages,  1992;  Bouleau  and  Lepingle, 
1994). 

6.1.3  Approximation  Error  of  Quasi-Monte  Carlo 

One  disadvantage  of  Quasi-Monte  Carlo  methods  is  that  estimation  of  the  ap- 
proximation error  is  typically  very  complicated.  An  error  bound  for  approximations 
of  the  form  (6.2)  is  given  by  the  Koksma-Hlawka  inequality  (see,  e.g.,  Fang  and 


Xjt  = (<Mfc),-,<MA;))',£  = 0, 1,...,M-  1. 


(6.4) 


Xfc  = (Mm  + k ),  ....^(TOd  + k))',k  = 0, 1, ...,  M — 1 


(6.5) 
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Wang,  1994) 


I -I  <V(f)D[ 


Mi 


(6.6) 


where  V(f)  denotes  the  total  variation  of  / in  the  sense  of  Hardy  and  Krause  (see 
Niederreiter,  1978,  p.966).  The  error  bound  in  (6.6)  has  only  limited  usefulness, 
since  V(f)  is  difficult  to  estimate.  Classical  Monte  Carlo  methods  typically 
estimate  the  error  in  (6.6)  by  techniques  based  on  the  Strong  Law  of  Large 
Numbers  and  the  Central  Limit  Theorem.  These  techniques,  however,  require  the 
sequence  of  points,  {xx,  ...,xM},  to  be  random.  Since  Quasi-Monte  Carlo  methods 
are  based  on  deterministic  sequences,  these  techniques  do  not  apply.  This  drawback 
has  lead  to  the  development  of  randomized  Quasi- Monte  Carlo  integration. 

6.1.4  Randomized  Quasi-Monte  Carlo 

Several  authors  have  suggested  using  randomized  low  discrepancy  sequences 
(Shaw,  1988;  Owen,  1998;  Wang  and  Hickernell,  2001).  Owen  (1998)  proposes  that 
a randomized  low  discrepancy  sequence,  {xi, . . . ,xM},  should  have  the  following 
properties: 

1.  Every  element  of  the  sequence  is  a random  point  of  the  unit  cube;  that  is, 


x,  ~ Unif([0,  l]d),  Vi. 

2.  The  sequence  {xi, . . . ,xM}  is  a low  discrepancy  sequence  with  probability 


one. 

Property  1 makes  the  estimator  in  equation  (6.2)  an  unbiased  estimate  of  I and 
property  2 preserves  the  uniformity  of  the  sequence  of  points. 

For  R independent  randomized  low  discrepancy  sequences  let  /(H, . . . , I 
be  the  corresponding  estimates  based  on  equation  (6.2).  The  variance  of  the 
(randomized)  Quasi-Monte  Carlo  estimate  can  then  be  estimated  by 


(6.7) 
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where  7 = J2f=i  I^/R-  One  can  thus  use  equation  (6.7)  to  estimate  the  approxi- 
mation error  in  (6.6). 


6.1.5  Randomized  Halton  Sequences 

We  mentioned  in  Section  6.1.2  that  the  Halton  sequence  in  (6.5)  is  a low 
discrepancy  sequence  regardless  of  the  starting  point.  Wang  and  Hickernell  (2001) 
use  this  fact  to  develop  Halton  sequences  with  random  starting  points. 

For  simplicity,  let  us  consider  the  univariate  case  only,  d = 1.  The  extension  to 
the  multivariate  case  is  straightforward.  Let  xq  be  a randomly  chosen  point  in  the 
unit  interval,  xq  ~ Unif([0, 1]),  and  let  5 be  a prime  integer.  Since  the  construction 
of  the  Halton  sequence  is  based  on  the  radical  inverse  function,  which  operates  on 
the  integers,  we  need  to  find  a suitable  integer-representation  for  x0.  Notice  that  we 
can  write  x0  in  base-5  notation  as 


say,  for  dj  € {0, 1, ...,  5 — 1}.  The  integer  J is  typically  a large  number  and  chosen  to 
satisfy  a predefined  accuracy  (which  could  be,  for  example,  higher  than  the  actual 
computer  accuracy).  The  base-5  integer- representation  of  the  starting  point  x0  is 
then 


Therefore,  since  xq  is  random,  so  is  the  integer  m.  The  randomized  (univariate) 
Halton  sequence,  (x0,  ...,Xm-i},  starts  at  the  (random)  integer  m and  computes 
successive  elements  according  to 


m = m(x0)  = djbJ  + ...  + d0b°  = <j)h  x(i0). 


Xk  = (/>b(m  + k),  k = 0, . . . , M — 1. 


(6.8) 


Wang  and  Hickernell  (2001)  show  that  if  the  starting  point  of  the  Halton 
sequence  is  random,  then  every  consecutive  point  of  the  sequence  is  random  also, 
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£0  ~ Unif([0, 1])  =>  (j)b(m(x 0)  + k)  ~ Unif([0,  l]),k  = 1,  2, 3, ...  . (6.9) 


Therefore,  the  randomized  Halton  sequence  satisfies  properties  1 and  2 of  Section 


To  this  point  we  have  introduced  Quasi-Monte  Carlo  methods  and  outlined 
how  to  approximate  integrals  over  the  unit  cube  with  the  help  of  these  methods. 
In  this  section  we  show  how  Quasi-Monte  Carlo  can  be  used  to  approximate  the 
intractable  likelihood  function  in  (2.6).  Recall  that  the  likelihood  involves  an 
integral  of  the  form 


where  71  denotes  the  range  of  integration.  Except  for  few  special  cases,  the  range 
of  integration  in  (6.10)  will  be  different  from  the  unit  cube,  Cd.  Therefore,  Quasi- 
Monte  Carlo  methods  can  typically  not  be  applied  directly  to  the  integral  in 
(6.10). 

In  order  to  apply  Quasi-Monte  Carlo  to  (6.10)  one  has  to  find  a suitable 
transformation  from  TZ  into  the  unit  cube.  Clearly,  such  a transformation  will 
only  exist  for  few  special  cases  of  the  GHM.  The  GLMM,  by  its  very  nature,  is  an 
example  of  a GHM  for  which  such  a transformation  often  exists. 

6.2.1  Probability  Integral  Transformation  for  GLMMs 

In  a GLMM  one  typically  assumes  normality  for  the  random  effects,  that  is, 
one  assumes  that  <7(11;  t/>2)  ~ iV(0,  G).  If  we  assume  that,  in  addition,  the  random 
effects  are  independent,  that  is,  if  we  assume  that  G is  a diagonal  matrix  with 


6.1.4. 


6.2  Quasi-Monte  Carlo  methods  in  GLMMs 


(6.10) 
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diagonal  elements  of , i = 1, . . . , q,  then  we  can  write  equation  (6.10)  as 

L(^\y)  = [ f(y\ui,  ...,uq-  xff)  Yl  duu  duq,  (6.11) 


i= 1 


where  ip  denotes  the  pdf  of  the  standard  normal  distribution.  Using  the  probability 
integral  transformation,  v{  = where  $(x)  denotes  the  cdf  of  the  standard 

normal  distribution,  equation  (6.11)  becomes 


LW> W)=  /(y|$  1(o-i^i), . . . , ^ 1(aqvt)\^l)  dv1,...,dvq.  (6.12) 

J Ci 

Equation  (6.12)  makes  the  application  of  Quasi-Monte  Carlo  straightforward. 

We  point  out  that  only  few  practical  problems  will  allow  a representation  of 
the  form  (6.12).  For  instance,  if  the  dependence  structure  of  the  random  effects  is 
more  complex  (than  the  assumed  independence  above),  then  this  method  will  not 
work.  Although  there  exist  further  useful  transformation  methods  (see  Robert  and 
Casella,  1999,  Chapter  2.2),  the  practitioner  may  often  encounter  situations  where 
a transformation  into  the  unit  cube  does  not  exist. 


6.2.2  Efficiency  of  SML  using  Quasi-Monte  Carlo 

In  the  following  two  examples  we  investigate  how  the  efficiency  of  SML  can  be 
improved  by  using  Quasi-Monte  Carlo  sampling  instead  of  classical  Monte  Carlo. 

OWMM.  Consider  the  OWMM  introduced  in  Section  2.1.3  and  assume  that 
the  variance  components,  of  and  of,  are  known  and  that  we  are  only  interested  in 
estimating  the  general  mean  p. 

For  each  of  the  following  eight  effects  variance  components,  of  € {0.05;  0.1;  0.2;  0.4; 
0.6;  0.8;  1;  1.5},  we  simulated  20  sets  of  data  from  the  OWMM  (with  p and  of  held 
fixed  at  0 and  1,  respectively).  For  each  set  of  data,  we  estimated  p 50  times  us- 
ing SML  with  each  of  the  two  sampling  methods  (and  fixed  Monte  Carlo  sample 
size  M).  For  each  value  of  of,  we  computed  the  mean  squared  errors  for  the  esti- 
mates based  on  classical  Monte  Carlo  (say,  MSEmc)  and  randomized  Quasi-Monte 
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Figure  6.3:  Relative  Mean  Squared  Errors  of  Quasi-Monte  Carlo  for  OWMM. 

Carlo  (say,  MSEqmc).  We  repeated  the  experiment  for  four  different  values  of 
M,  M € {10;  50;  100;  200}.  Figure  6.3  shows  plots  of  the  relative  mean  squared 
errors,  MSEmc/MSEqmc,  as  a function  of  o\  for  M = 10  (line  1),  M — 50  (line  2), 
M = 100  (line  3)  and  M = 200  (line  4).  The  dotted  line  in  Figure  6.3  represents 
equal  mean  squared  errors,  that  is,  MSEmc  = MSEqmc. 

Observe  that  the  relative  mean  squared  errors  increase  as  a\  decreases;  that 
is,  the  smaller  of,  the  higher  the  gain  from  using  randomized  Quasi-Monte  Carlo 
over  classical  Monte  Carlo.  Moreover,  this  gain  becomes  larger  for  larger  Monte 
Carlo  sample  sizes.  This  implies  that  the  advantage  from  using  more  uniformly 
distributed  low  discrepancy  sequences  increases  the  more  points  we  sample. 
Classical  Monte  Carlo  performs  almost  as  well  when  we  sample  only  few  points. 
However,  with  increasing  M,  Quasi-Monte  Carlo  methods  gain  over  classical  Monte 
Carlo,  since  they  avoid  clusters  and  explore  the  sample  space  more  thoroughly. 


0.001 
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Figure  6.4:  Integrand  of  likelihood  in  (6.12)  for  a\  =0.1 


Figure  6.5:  Integrand  of  likelihood  in  (6.12)  for  <j\  = 2 
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Another  interesting  feature  in  Figure  6.3  is  that  the  gain  in  using  Quasi- 
Monte  Carlo  decreases  as  the  effects  variance  component  a\  increases.  Morokoff 
and  Caflisch  (1995,  p.218)  found  in  empirical  studies  that  “Quasi-Monte  Carlo 
is  superior  to  classical  Monte  Carlo,  but  the  advantage  may  be  slight  [...]  for 
integrands  that  are  not  smooth”.  Figure  6.4  shows  the  plot  of  the  likelihood- 
integrand  in  (6.12)  for  a two-dimensional  random  effect,  u £ 7 Z2,  with  o\  — 0.1.  We 
can  see  that  the  integrand  is  a very  smooth  curve  in  the  unit  square.  Conversely, 
Figure  6.5  shows  a plot  of  the  integrand  for  a large  variance  component  (erf  — 2). 
Clearly,  for  this  variance  component  the  integrand  is  very  non-smooth  and  spiky. 

Logistic-Normal  Model.  For  the  logistic-normal  model  described  in  Section 
5.2.2  (using  the  “original”  data  with  MLEs  ft  = 6.132  and  a2  — 1.766)  we 
conducted  a simulation  study  similar  to  the  previous  example.  We  estimated 
both  parameters  repeatedly,  100  times  using  SML  with  classical  Monte  Carlo  and 
100  times  using  SML  with  randomized  Quasi-Monte  Carlo.  We  performed  the 
experiment  two  times,  with  M = 1,000  in  the  first  run  and  M - 5,000  in  the 
second  run.  Table  6.1  shows  the  bias  and  mean  squared  error  for  the  estimates 
based  on  Monte  Carlo  sampling  (MC)  as  well  as  randomized  Quasi-Monte  Carlo 
(QMC)  and  their  ratios  (MC/QMC). 

We  can  see  that  SML  based  on  randomized  Quasi-Monte  Carlo  outperforms 
classical  Monte  Carlo  in  terms  of  the  bias  as  well  as  the  mean  squared  error 
for  both  parameter  components.  Although  the  efficiency  gain  may  be  small,  it 
increases  as  the  Monte  Carlo  sample  size  increases.  Since  the  variance  component, 
a2  = 1.766,  can  be  considered  rather  large  (implying  a potentially  very  non-smooth 
likelihood  integrand),  we  could  expect  an  even  larger  gain  for  smaller  values  of  a2. 

6.3  Conclusion 

Quasi-Monte  Carlo  is  an  appealing  alternative  to  classical  Monte  Carlo. 

The  concept  of  deterministic  sequences  that  explore  the  sampling  space  more 
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Table  6.1:  Relative  Efficiency  of  Quasi-Monte  Carlo  for  McCulloch’s  model 


Original  Data  0, 

a2)  = (6.132,1.766) 

P 

a2 

MC 

QMC 

MC/QMC 

MC 

QMC 

MC/QMC 

Bias 

0.01236 

0.01205 

1.02573 

-0.13340 

-0.04561 

2.92480 

MSE 

0.14425 

0.10013 

1.44063 

0.43418 

0.27155 

1.59890 

(Monte  Carlo  Sample  Size  M = 1000) 

Bias 

0.01694 

0.00113 

14.9912 

-0.03817 

-0.00627 

6.08772 

MSE 

0.05802 

0.02062 

2.81377 

0.33499 

0.14184 

2.36174 

(Monte  Carlo  Sample  Size  M = 5000) 

thoroughly  than  random  points  has  great  theoretical  appeal  and  has  certainly  also 
some  practical  value  as  pointed  out  in  the  previous  examples. 

However,  application  of  Quasi-Monte  Carlo  to  GHMs  is  limited.  The  need  for 
a suitable  transformation  into  the  unit  cube  makes  Quasi-Monte  Carlo  applicable 
only  in  few  special  cases.  For  SML,  when  the  distribution  of  the  random  effects  is 
normal  (or  also  for  other  “standard”  distributions,  like  exponential,  Poisson,  beta 
or  gamma),  such  a transformation  may  often  be  available.  However,  for  methods 
like  MCEM,  that  require  sampling  from  an  often  very  complicated,  conditional 
distribution,  this  may  not  be  the  case. 

Moreover,  the  OWMM  example  has  shown  that  the  gain  in  using  Quasi- 
Monte  Carlo  integration  may  be  very  small  for  situations  with  very  non-smooth 
integrands,  like  the  SML-integrand  for  large  effects  variance  components.  However, 
in  the  context  of  mixed  models  situations  with  larger  variance  components  are 
certainly  of  greater  interest,  since,  if  the  effects  variance  components  are  small,  the 
mixed  model  can  often  be  approximated  sufficiently  well  by  a fixed  effects  model 
(for  which  Monte  Carlo  methods  are  not  needed  at  all). 


CHAPTER  7 

EFFICIENCY  OF  STOCHASTIC  APPROXIMATION 


The  previous  two  chapters  focused  on  investigating  the  performance  of  SML 
relative  to  that  of  MCEM.  Now  we  turn  our  attention  to  stochastic  approximation 
procedures.  As  we  have  already  pointed  out,  stochastic  approximation  methods 
have  an  apparent  advantage  over  methods  like  MCEM,  since  they  reduce  the  error 
variance  without  increasing  the  Monte  Carlo  sample  size.  In  the  following  we 
compare  the  performance  of  SAEM  with  that  of  MCEM. 

7.1  Convergence  Rate  of  MCEM  and  SAEM 
Stochastic  algorithms,  like  MCEM  or  SAEM,  can  be  decomposed  into  two 
components;  a deterministic  component,  that  follows  the  underlying  deterministic 
version  of  the  algorithm,  and  a random  noise  component  due  to  Monte  Carlo  error. 
Just  as  the  deterministic  version  of  MCEM  is  the  EM  algorithm,  there  is  also  a 
deterministic  version  underlying  SAEM.  Replacing  Q by  Q (or  F by  F)  in  equation 
(4.32)  (or  (4.33)),  reveals  the  underlying  deterministic  version  of  SAEM. 

In  order  to  investigate  the  convergence  behavior  of  a stochastic  algorithm, 
both  of  these  components  have  to  be  examined.  The  noise  component  determines 
the  variability  of  the  parameter  update  about  the  “average”  update,  the  update 
of  the  underlying  deterministic  version.  This  variability  can,  at  least  in  principle, 
be  made  arbitrarily  small  by  simply  increasing  M,  the  Monte  Carlo  sample  size. 

In  the  limit,  as  M approaches  infinity,  all  noise  is  removed  from  the  system 
and  the  rate  at  which  the  algorithm  converges  is  entirely  determined  by  its 
deterministic  component.  In  the  following  we  investigate  the  convergence  rates 
of  the  deterministic  components  of  MCEM  and  SAEM. 
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Our  starting  point  is  the  approximate  relation  in  (3.17);  that  is,  in  a neighbor- 
hood of  the  MLE  the  EM  update  satisfies 

t/j(t+1>  — tjj  « B(t/>^  — i/>),  (7.1) 

where  ^(5x1)  and  B (5x5)  is  the  rate  matrix  defined  in  equation  (3.18). 
Equation  (7.1)  implies  that  the  rate  at  which  EM  converges  is  Bt+1,  since  — 

■0  « B4+1(i//0^  — -0). 

Often,  it  is  desirable  to  have  a univariate  measure  for  the  convergence  rate. 

Let  if)  — (ipi, . . . , ips)'  denote  the  parameter  components  and  bi, ...  ,bs  the 
eigenvalues  of  B.  It  follows  from  the  Spectral  Decomposition  Theorem  and 
equation  (7.1)  that  if  ipj,  j — 1, . . . , S,  is  associated  with  a large  bj,  then  this 
parameter  component  converges  at  a slow  rate  (and  vice  versa).  Therefore,  some 
authors  suggest  using  bmax,  the  largest  eigenvalue  of  B,  as  a measure  of  the 
convergence  rate  of  EM  (see  Dempster  et  ah,  1977;  Laird  et  ah,  1987;  Meng,  1994). 
Clearly,  bmax  measures  the  convergence  of  the  slowest  component  of  i/>.  However,  it 
does  not  provide  an  overall  measure  of  the  convergence  rate. 

On  the  other  hand,  recall  that  the  determinant  of  B can  be  expressed  as  the 
product  of  its  eigenvalues, 

5 

pi=nv  p-2) 

3= 1 

Since  |B|  is  a monotonically  increasing  symmetric  function  of  the  eigenvalues,  it 
provides  a good  measure  for  the  overall  convergence  rate  of  an  algorithm  1 . Thus, 


1 In  similar  fashion,  in  multivariate  analysis  where  S denotes  sample  covariance 

matrix,  one  often  uses  |S|  rather  than  the  largest  eigenvalues  of  S as  a univariate 
measure  of  the  overall  scatter  about  the  mean  (see  Mardia  et  ah,  1979). 
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EM’s  overall  convergence  rate  can  be  expressed  as 

s 


IB 


r=nr- 

j= i 


(7.3) 


Next  we  establish  that  all  eigenvalues  of  B lie  in  the  interval  [0, 1].  Let 
JG  = J(,0)  be  the  observed-data  information  matrix,  where  J (ip)  is  the  negative 
second  derivative  of  the  log-likelihood  of  -0, 

d 2 logL(V>|y) 


JW 


with  L(tf) |y)  defined  in  (2.6).  On  the  other  hand,  the  complete-data  information 
matrix  is  given  by 


(7.4) 


'j2  log  /( y,  u;  </>) 


Since  u is  unobserved,  one  averages  (7.5)  over  the  conditional  distribution  of  u 
given  y to  get 

d2  log/(y,u;^) 


(7.5) 


J,  = -E 


d^d^1 

S2Q(i>,W 


y 


(7.6) 


•0,  ,-0=V’ 


From  the  factorization 


/(y,  u;  t/j)  = L(,0|y)/(u|y;  ip),  (7.7) 

it  follows  that  the  log-likelihood  of  is 

logL(V>|y)  = log/(y,u;t/>)  - log/(u|y;  t/?).  (7.8) 

Equation  (7.8)  implies  that  after  taking  second  derivatives,  averaging  with  respect 
to  /(u|y;V>),  and  evaluating  at  -0  = we  get  the  fundamental  identity, 


Jo  — Jc  J»7l) 


(7.9) 
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which  is  often  found  useful  in  the  context  of  the  EM  algorithm.  In  equation  (7.9) 
we  defined  the  matrix 


-E 


d 2 log/(u|y;V>) 


y; 


xj)=xj) 


dxjidxp' 

which  can  be  viewed  as  the  missing  information.  Oakes  (1999)  shows  that 


(7.10) 


dxj>tdx p' 


(7.11) 


V,..V>=V> 


Equation  (7.9)  can  be  written  in  the  appealing  form 


observed  information  = complete  information  — missing  information, 


which  is  often  referred  to  as  the  missing  information  principle  (see,  e.g.,  Louis, 

1982;  Meng  and  Rubin,  1991). 

Dempster  et  al.  (1977)  as  well  as  Meng  and  Rubin  (1991).  show  that  for  the 
rate  matrix 

B = J.nJ;1  (7.12) 


(see,  e.g.,  Dempster  et  al.,  1977;  Meng  and  Rubin,  1991).  It  follows  from  equations 
(7.9)  and  (7.12)  that 

J0  = (I-B)JC.  (7.13) 


Notice  that  if  xp  is  a local,  if  not  global,  maximum  of  logL(-0|y),  then,  necessarily, 
J0  is  positive  (semi-)  definite.  Furthermore,  Dempster  et  al.  (1977,  Theorem  4) 
show  that  Jc  is  positive  definite.  It  follows  (see  Mardia  et  al.,  1979,  Corollary 
A. 7.3.1)  that  all  non-zero  eigenvalues  of  I — B are  positive;  that  is,  1 — bj  > 0,  where 
bj  denotes  the  jth  eigenvalue  of  B.  This  implies  that  all  eigenvalues  of  B are  real 
and  less  than  (or  equal  to)  one.  Furthermore,  since 


J m = Var 


/ <91og/(y,u;VQ 
V dxj> 


y;^ 


(7.14) 


X 
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(see  Louis,  1982),  Jm  is  also  positive  (semi-)  definite.  It  follows  from  equation 
(7.12)  that  all  non-zero  eigenvalues  of  B are  positive.  In  summary,  all  eigenvalues 
of  B lie  in  the  interval  [0, 1]. 

Consider  the  univariate  case  for  an  intuitive  explanation  of  this  result.  Let 
ijj  E 711  and  let  b be  the  rate  constant  satisfying  the  approximate  relation  in  (7.1). 
Then,  equation  (7.12)  implies  that 

missing  information  = b x complete  information. 

Since  the  missing  information  cannot  be  negative  and  since  it  cannot  exceed  the 
complete  information,  necessarily  b E [0, 1]. 

We  now  derive  the  deterministic  component  of  the  SAEM  algorithm.  Recall 
that  the  (t  + l)st  SAEM  update  satisfies  Ft+i(t/>)  = 0,  with  Ft+i  defined  in  (4.33). 
Consequently,  replacing  F by  F in  (4.33)  yields  the  estimating  equations  for  the 
deterministic  component  of  SAEM;  that  is,  the  (t  + l)st  update  of  the  deterministic 
SAEM  algorithm  solves  Gt+i(V0  = 0 where 

G1+iM>)  = (1  - T.JG.WO  +7«F(</-,V’W).  (7.15) 

An  iterative  argument  shows  that 

t 

GhiW  = E'^F».*<#).  (7'16) 

i=0 

where  wt<i  = 7in[j=i+i(1  “ 7j)  and  Y?i=Qwt,i  = 1 (by  induction).  Using  EM’s 
approximate  relation  in  equation  (7.1),  the  ( t -I-  l)st  update  of  the  deterministic 
SAEM  algorithm  can  then  be  written  as 

t 

t/>(t+1)  - i!>  ~ ^iet)jB(i//1)  - V?). 
i=0 


(7.17) 
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A standard  weighting  scheme  is  (see,  e.g.,  McCulloch  and  Searle,  2001) 

It  = j^~t,  t = 0,1,2,...,  (7.18) 

where  A and  B are  positive  integers.  Taking  A = B = 1,  it  follows  that  wtfi  — 

1 /{t  + 1).  Straightforward  calculations  show  that  equation  (7.17)  then  reduces  to 

•0(t+1)  — « Bf+i('0(o)  — ip),  (7.19) 

where  the  convergence  rate  of  deterministic  SAEM  is  determined  by  the  matrix 

S.«  = ri^-  (7-20) 

k=0 

In  the  following  we  compare  the  convergence  rates  of  EM  and  deterministic  SAEM. 
We  start  our  discussion  with  the  univariate  parameter  case. 


7.1.1  Univariate  Parameter  Case 

Assume  that  ^ G 77. 1 and  let  the  rate  constant  b satisfy  (7.1).  It  follows  that 
EM’s  convergence  rate  is  6t+1.  On  the  other  hand,  the  scalar  version  of  the  matrix 
in  equation  (7.20)  is 


7 tt  b + k fo  + 16  + 2 b + t r(6  + t + 1) 

*+1  = M 1 + k = °~2  o- ' ' ' t+i  = r(t)r(t  + 2) 


(7.21) 


If  we  let  f(t)  ~ g(t)  denote  asymptotic  equivalence  of  / and  g as  t — > oo,  then  the 
convergence  rate  of  deterministic  SAEM  is 


Jt+i 


j.b—1 


m 


(7.22) 


which  follows  from  Abramowitz  and  Stegun  (1992,  p.257).  Thus,  the  deterministic 
algorithm  underlying  SAEM  converges  at  an  algebraic  rate,  compared  to 
the  exponential  rate,  of  EM.  That  is,  EM  converges  at  a much  faster  rate  than 
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Figure  7.1:  Convergence  of  EM  and  deterministic  SAEM  for  different  values  of  b. 


deterministic  SAEM.  The  following  examples  illustrate  the  performance  of  both 
deterministic  algorithms  for  different  values  of  b. 

Figure  7.1  shows  the  convergence  rate  of  EM  (dashed  lines)  and  deterministic 
SAEM  (solid  lines)  for  the  values  b between  0.1  and  0.9.  Clearly,  EM  outperforms 
deterministic  SAEM.  In  particular,  SAEM’s  convergence  rate  is  decent  relative  to 
that  of  EM  for  values  of  b close  to  0.  However,  it  deteriorates  fast  as  b approaches 
one. 


7.1.2  Multivariate  Parameter  Case 

We  now  focus  our  attention  on  the  multivariate  parameter  case,  £ 72.s . The 
following  theorem  generalizes  the  results  from  Section  7.1.1. 

Theorem  7.1.  Suppose  the  rate  matrix  B in  equation  (3.18)  has  S eigenvalues, 

0 < bj  < 1,  j = 1, . . . , S.  Let  t denote  the  iteration  number.  Then, 
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(i)  MCEM  converges  at  a rate  of  |Bjt+1; 

(ii)  SAEM,  with  weights  of  the  form,  7t  = 1/(1  + t),  converges  at  a rate  of 
|B|  ttrW~s /c,  where  c = Ilj=i  r(6j  + 1). 

Proof.  We  have  already  shown  that  EM’s  convergence  rate  in  the  multivariate  case 
is  |B|<+1.  To  establish  (ii),  we  define 

t 


Ct+ 1 — 


t 


- and  A = n(I  + B /k). 


k= 1 


Then  we  can  write  equation  (7.20)  as  Bt+i  = Ct+iT>t.  Notice  that 

t t s 


k= 1 


p^nn+B/^nnd+l)- 

k= 1 j= 1 


We  have  already  argued  that 


TT/'i  ^ j \ ^ 

n<i+ 


jfe^i 


k • m + 1) 


Equations  (7.24)  and  (7.25)  imply  that 


(7.23) 


(7.24) 


(7.25) 


|Z>,I  = 1^,/YlTV,  + 1)  = t‘r<B)/c.  (7.26) 

3= 1 

Part  (ii)  of  the  theorem  now  follows,  since  |Ct+i|  = |B|/(t+  l)5  ~ |B |t— ■s’.  □ 


Figure  7.2  illustrates  the  convergence  rates  of  both  algorithms  for  the  two  di- 
mensional parameter  case,  iJj  E V?.  It  shows  the  iteration  histories  of  both  parame- 
ter components  for  deterministic  SAEM  (solid  lines)  and  EM  (dashed  lines)  for  six 
different  pairs  of  eigenvalues  (EV),  (6i , 62)  6 {(0.07, 0.03),  (0.12, 0.04),  (0.21, 0.04), 
(0.41,0.04),  (0.8, 0.3),  (0.94,0.36)}.  Notice  the  different  scale  in  the  last  four  plots. 

Figure  7.2  suggests  that  both  algorithms  work  (almost)  equally  well,  if  all 
eigenvalues  of  B are  close  to  zero.  However,  the  performance  of  deterministic 
SAEM  decreases  at  a fast  pace  (relative  to  that  of  EM),  if  at  least  one  of  the 
eigenvalues  is  significantly  larger  than  zero. 


EV:  0.94  and  0.36  EV:  0.8  and  0.3  EV:  0.41  and  0.04  EV:  0.21  and  0.04  EV:  0.12  and  0.04  EV:  0.07  and  0.04 
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Figure  7.2:  Convergence  of  EM  and  deterministic  SAEM. 
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Figure  7.3:  MCEM  and  SAEM  for  the  logistic  normal  model. 

7.2  Examples 

We  illustrate  the  convergence  behavior  of  MCEM  and  SAEM  in  the  following 
two  examples.  The  first  example  is  the  logistic-normal  model  introduced  in  Section 
5.2.2.  The  second  is  an  example  of  a Poisson-gamma  model. 

7.2.1  Logistic-Normal  Model 

For  the  logistic-normal  model  introduced  in  Section  5.2.2  (using  the  “original” 
data  with  MLEs  /?  = 6.132  and  a2  = 1.766)  we  can  use  equation  (7.12)  to  calculate 
the  rate  matrix.  Using  numerical  integration  to  evaluate  Jc  and,  in  addition, 
numerical  differentiation  to  evaluate  Jm,  we  find  that  the  eigenvalues  of  B for  this 
model  are  Ax  = 0.8143  and  A2  = 0.3686.  Since  both  of  the  eigenvalues  are  rather 
large,  we  expect  SAEM  to  perform  poorly  for  this  model. 
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Table  7.1:  Fish  species  in  lakes 


y 
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Number  of 
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Figure  7.3  shows  the  first  15  iterations  of  MCEM  and  SAEM  (using  the 
weights  7 1 = 1/(1  + t ))  with  a constant  Monte  Carlo  sample  size,  M = 1000,  per 
iteration,  to  estimate  /3  (A)  and  a2  (o).  Using  such  a large  M eliminates  random 
noise  in  the  algorithms  almost  entirely  and  thus  approximates  their  deterministic 
versions  well.  We  observe  that  MCEM  reaches  the  neighborhood  of  the  MLE  after 
about  13  iterations  for  both  parameter  components.  However,  SAEM  at  this  point 
is  still  far  from  convergence. 

7.2.2  Poisson-Gamma  Model 

Consider  the  data  in  Table  7.1,  presented  in  Stein  (1988).  For  70  lakes  of 
the  world  the  number  of  fish  species  ( y ) and  the  area  of  the  lake  (z)  have  been 
recorded.  Stein  reported  that  a Poisson  log-linear  model  with  log(a:)  as  the  linear 
predictor  results  in  a deviance  of  1538  with  68  degrees  of  freedom.  This  indicates 
that  the  data  is  highly  over-dispersed  relative  to  Poisson  variation.  Stein  suggests 
the  following  Poisson-gamma  model  to  account  for  the  extra  variability  in  the  data. 
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Figure  7.4:  MCEM  and  SAEM  for  the  poisson  gamma  model. 

Suppose  that  conditional  on  Ui,  yi\ui  ~ Poisson(/ij Ui),  where  = exp{/31og(:ri)}. 
Further  suppose  that  Uj  ~ Gamma(n,  k).  Then  the  marginal  distribution  of  y*  is 
negative  binomial  with  index  k and  mean  p*.  Therefore,  the  likelihood  function  is 
available  in  closed  form,  resulting  in  the  MLEs  /?  = 0.4921  and  k = 0.5436. 

As  in  the  previous  example,  we  calculated  the  rate  matrix  and  found  that  the 
eigenvalues  of  B are  = 0.9936  and  A2  = 0.2331.  Since  Ai  is  very  close  to  1,  we 
expect  SAEM  to  perform  extremely  poorly  compared  to  MCEM. 

Figure  7.4  shows  the  first  500  iterations  of  MCEM  and  SAEM  (using  the 
weights  = 1/(1  + t ))  with  a constant  Monte  Carlo  sample  size,  M = 1000,  per 
iteration.  The  iteration  history  for  /3  is  given  by  the  thick  dotted  lines  and  the 
one  for  k by  the  thick  solid  lines.  Notice  that  MCEM  converges  significantly  more 
slowly  than  in  the  previous  example,  reaching  the  neighborhood  of  the  MLE  only 
after  about  250  iterations  (for  k)  and  500  iterations  (for  (3),  respectively.  However, 
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the  underlying  rate  for  SAEM  is  disastrous.  In  fact,  the  SAEM  iteration  history 
appears  to  be  completely  flat  in  this  example. 

7.3  Efficiency  of  MCEM  and  SAEM 

In  the  previous  sections  we  investigated  the  convergence  rates  of  MCEM  and 
SAEM.  Intuition  suggests  that  an  algorithm  which  converges  at  a very  slow  rate 
makes  very  inefficient  use  of  the  total  simulation  amount.  In  the  following  we 
investigate  the  efficiency  of  MCEM  and  SAEM  empirically. 

Consider  the  balanced,  one-way  mixed  model 

Vi  = n + Ui  + ei,  i = (7.27) 

where  the  errors,  e,,  are  i.i.d.  N(0, 1)  independently  of  the  random  effects,  ui:  which 
are  assumed  to  be  i.i.d.  N(0,<r2),  with  a2  known.  Model  (7.27)  is  a special  case 
of  the  OWMM  with  r — 1 and  a2  = 1.  Based  on  model  (7.27)  we  conducted  a 
simulation  study  in  which  we  implemented  MCEM  and  SAEM  in  the  following  way. 

We  used  the  MCEM  algorithm  with  Booth  and  Hobert’s  automated  way  for 
increasing  the  Monte  Carlo  sample  size.  When  the  fth  step  size  was  swamped  by 
Monte  Carlo  error  we  increased  the  sample  size  according  to  Mt+\  t—  (1  + a) Mt, 
using  a = 0.2.  We  stopped  MCEM  and  declared  convergence  when  the  stopping 
rule  in  (4.14)  was  satisfied  for  three  consecutive  iterations,  using  Si  = 0.0001  and 
d2  = 0.005. 

Recall  that  in  order  to  use  SAEM,  two  (rather  subjective)  decisions  have  to  be 
made:  the  choice  weight,  7(,  as  well  as  the  choice  of  the  Monte  Carlo  sample  size, 
M,  which  is  held  constant  over  all  iterations.  In  our  simulations  we  implemented 
SAEM  for  twelve  different  combinations  of  74  and  M.  Specifically,  we  used  four 
different  weights  of  the  form  7t  = A/(A  + t),  A £ {1, 10, 100,  500},  as  well  as 
three  different  Monte  Carlo  sample  sizes,  M £ (1, 10,50}.  In  order  to  achieve 
a fair  comparison  between  MCEM  and  SAEM,  we  used  both  algorithms  with 
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the  same  total  simulation  amount;  that  is,  after  each  run  of  MCEM  we  recorded 
N = Mt,  and  then  ran  each  of  the  twelve  variations  of  SAEM  for  a total  of 
\N/M]  iterations,  resulting  in  the  same  total  simulation  amount  as  for  MCEM. 

Our  simulation  study  was  set  up  in  the  following  way.  For  each  of  the  variance 
components,  cr2  e {0.01, 0.5, 1, 2, 4},  we  simulated  50  sets  of  data  from  model 
(7.27)  with  the  parameters  n = 10  and  //  = 0 held  fixed.  For  each  set  of  data  we 
estimated  /r  ten  times  with  MCEM  and  SAEM,  using  randomly  chosen  starting 
values.  Figure  7.5  shows  box-plots  of  the  standardized  estimates,  that  is,  the 
difference  between  the  parameter  estimates  and  the  MLE. 

Figure  7.5  suggests  that  MCEM  is  more  efficient  than  SAEM.  Moreover,  the 
performance  of  SAEM  depends  strongly  on  the  combination  of  7*  and  M . For 
example,  for  A = 1,  smaller  values  of  M seem  to  lead  to  a better  performance.  On 
the  other  hand,  for  A = 500,  larger  values  of  M lead  to  a more  efficient  algorithm. 
Overall,  however,  both  of  the  values,  A = 1 and  A = 500,  result  in  a less  efficient 
algorithm  than  A = 10. 

7.4  Conclusion 

The  appeal  of  stochastic  approximation  procedures  is  their  ability  to  converge 
with  constant  Monte  Carlo  sample  size,  hence  making  the  decision  about  whether 
and  when  to  increase  M obsolete. 

However,  there  are  clear  disadvantages  associated  with  convergence  for 
constant  M.  We  have  seen  that  SAEM  typically  converges  at  a much  slower 
rate  than  MCEM,  which  can  result  in  a less  efficient  use  of  the  total  simulation 
amount.  Although  the  performance  of  SAEM  (and  also  SANR)  can  be  improved 
by  modifying  the  weight,  7 and  the  (constant)  Monte  Carlo  sample  size,  M,  at 
the  same  time,  since  no  guidelines  about  the  optimal  choice  of  7 * and  M exist, 
implementation  of  these  algorithms  is  complicated.  Furthermore,  we  have  pointed 
out  in  Section  4.4.2  that  common  stopping  rules  are  not  appropriate  for  stochastic 
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approximation  procedures.  Thus,  the  decision  to  stop  the  algorithm  and  declare 
convergence  is  often  done  subjectively,  for  example,  by  inspecting  a plot  of  the 
parameter  updates  versus  the  iteration  number. 


CHAPTER  8 
CONCLUSIONS 


8.1  Summary  of  Results 

The  goal  of  this  dissertation  was  to  compare  the  performance  of  stochastic 
estimation  methods  for  GHMs.  We  conclude  this  work  with  a summary  of  our  most 
important  results. 

We  started  this  work  by  describing  the  GHM  and  several  of  its  special  cases. 
We  pointed  out  the  benefits  of  this  class  of  models  and  indicated  its  wide  range 
of  applications.  We  argued,  however,  that  the  practical  use  of  these  models  is 
complicated  by  the  fact  that  the  likelihood  function  is  typically  analytically 
intractable. 

We  started  our  discussion  of  maximum  likelihood  computation  by  reviewing 
several  deterministic  approaches.  We  described  the  ideas  of  penalized  quasi- 
likelihood and  numerical  integration  and  pointed  out  disadvantages  associated  with 
these  two  approaches.  Then  we  discussed  Newton-Raphson  and  the  EM  algorithm. 
We  explained  the  ideas  behind  these  two  algorithms  and  argued  that  both  are 
typically  not  directly  applicable  to  GHMs. 

This  dissertation  focused  on  stochastic  approaches  for  maximum  likelihood 
computation.  Among  these,  we  considered  five  different  approaches:  SML,  which 
approximates  the  entire  likelihood  function,  as  well  as  four  iterative  methods, 
MCEM,  MCNR,  SAEM  and  SANR,  which  compute  only  the  likelihood  mode. 
Among  these  four  iterative  methods  we  distinguished  between  methods  based  on 
Newton-Raphson  and  those  based  on  the  EM  algorithm.  We  also  distinguished 
between  methods  that  require  the  Monte  Carlo  sample  size  M to  be  increased 
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successively  for  convergence  and  those  which  converge  with  constant  M,  based  on 
the  ideas  of  stochastic  approximation. 

Between  methods  based  on  EM  on  those  based  on  Newton-Raphson  we 
found  that  EM-based  methods  typically  perform  better.  Indeed,  we  found  that 
both,  MCEM  and  SAEM,  have  a much  larger  domain  of  attraction  than  their 
Newton-Raphson  counterparts  which  greatly  simplifies  the  implementation  of  these 
algorithms.  In  fact,  MCEM  and  SAEM  could  typically  be  run  with  randomly 
chosen  starting  values,  whereas  MCNR  and  SANR  often  required  a careful  search 
for  good  starting  points.  Moreover,  we  pointed  out  that  both,  MCNR  and  SANR, 
can  become  very  instable,  especially  for  small  Monte  Carlo  sample  sizes.  Indeed, 
we  found  that  for  small  values  of  M the  Monte  Carlo  approximation  to  the  Hessian 
matrix  can  vary  considerably,  which  frequently  causes  the  algorithms  to  diverge  to 
the  boundary  of  the  parameter  space. 

Furthermore,  in  our  comparison  of  MCEM  and  SAEM  we  found  that  MCEM 
is  generally  more  efficient.  Indeed,  in  Chapter  7 we  characterized  the  convergence 
rate  of  both  algorithms  and  found  that  in  GLMMs,  MCEM  typically  converges 
at  a much  faster  rate.  Our  simulations  support  that  this  fast  convergence  rate 
generally  results  in  a much  more  efficient  use  of  the  total  simulation  amount  for 
MCEM.  Moreover,  we  also  found  that  MCEM  is  easier  to  implement  than  SAEM. 
Indeed,  since  for  the  MCEM  algorithm  proposed  by  Booth  and  Hobert  (1999) 
the  decisions  when  to  increase  M and  when  to  stop  the  algorithm  are  automated, 
implementation  is  straightforward.  On  the  other  hand,  implementation  of  SAEM 
is  complicated  by  a set  of  subjective,  user-specific  decisions.  Prior  to  starting  the 
algorithm  the  user  has  to  decide  about  the  choice  of  the  initial  Monte  Carlo  sample 
size  M and  the  weight  7 1.  No  guidelines  concerning  the  optimal  choice  of  these 
two  parameters  exist.  However,  we  have  pointed  out  that  both  of  these  parameters 
can  influence  the  performance  of  SAEM  dramatically.  Furthermore,  since  there 
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are  no  valid  stopping  rules  for  stochastic  approximation  procedures,  convergence  is 
often  declared  rather  subjectively,  after  inspecting  a plot  of  the  parameter  updates. 
However,  different  users  will  declare  convergence  at  different  instances,  leading  to 
user-specific  differences  in  the  parameter  estimates. 

Between  MCEM  and  SML,  we  also  found  that,  generally,  MCEM  is  the  more 
efficient  method.  Indeed,  in  Chapter  5 we  derived  formulae  for  the  asymptotic 
standard  errors  of  MCEM  and  SML  which  show  that  in  most  practical  applications 
of  GLMMs,  SML  is  very  inefficient  relative  to  MCEM.  A simulation  study  and 
several  examples  supported  these  analytical  results.  In  addition,  we  also  pointed 
out  practical  difficulties  when  implementing  SML.  In  particular,  since  SML  requires 
a numerical  routine,  like  Newton-Raphson,  to  maximize  the  simulated  likelihood 
function,  good  starting  values  as  well  as  the  monitoring  of  the  likelihood  are 
typically  necessary  to  guarantee  convergence.  Using  Quasi-Monte  Carlo  techniques 
in  an  attempt  to  increase  the  efficiency  of  SML,  on  the  other  hand,  improved  its 
performance  only  for  the  cases  where  the  likelihood  integrand  is  sufficiently  smooth. 

Overall,  our  investigations  suggest  that  MCEM,  for  most  practical  applica- 
tions, is  more  efficient  and,  at  the  same  time,  easier  to  implement  than  any  of  the 
competing  methods  that  we  considered  in  this  work.  However,  some  words  of  cau- 
tions are  necessary  at  this  point.  In  this  dissertation  we  restricted  our  discussion 
of  MCEM  to  the  case  of  random  sampling  only,  that  is,  we  only  considered  cases 
for  which  it  is  possible  to  generate  independent  and  identical  samples  from  the 
conditional  distribution  of  the  random  effects  given  the  data.  There  are,  however, 
many  situations  for  which  random  sampling  is  not  possible  or,  at  least,  not  practi- 
cally feasible.  In  these  situations  one  would,  for  example,  use  a version  of  MCEM, 
proposed  by  McCulloch  (1997),  which  is  based  on  Markov  Chain  Monte  Carlo 
(MCMC)  sampling  (that  is,  dependent  sampling).  However,  Booth  and  Hobert’s 
arguments  for  automatically  increasing  the  Monte  Carlo  sample  size  are  not  easily 
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generalized  to  the  case  of  dependent  samples;  in  fact,  it  is  not  clear  at  all  whether 
such  a generalization  can  be  done  (see  Booth  and  Hobert  (1999)  for  a discussion  of 
this  issue).  One  solution  to  this  problem  is  proposed  by  Levine  and  Casella  (2001). 
They  suggest  to  use  MCEM  with  MCMC  sampling.  In  order  to  apply  Booth  and 
Hobert’s  arguments  to  their  algorithm,  they  obtain  approximate  independent  and 
identical  samples  by  sub-sampling  the  generated  MCMC  sample  during  different 
renewal  periods  of  the  Markov  Chain.  Although  their  algorithm  is  certainly  a step 
into  the  right  direction,  a lot  of  work  still  remains  to  be  done. 

8.2  Future  Research 

A variety  of  interesting  topics  for  future  research  still  remain.  We  conclude 
this  dissertation  by  naming  a few  of  them. 

Several  questions  are  still  unanswered  for  MCEM.  The  parameters  a and  a 
in  Booth  and  Hobert’s  version  of  MCEM  determine  the  width  of  the  confidence 
interval  and  the  fraction  by  which  the  Monte  Carlo  sample  size  should  be  increased. 
Booth  and  Hobert  give  rather  ad  hoc  recommendations  as  to  how  these  parameters 
should  be  chosen.  Investigating  the  optimal  choice  for  these  two  parameters  is 
certainly  an  interesting  topic  for  future  work. 

A variety  of  questions  are  also  unanswered  for  stochastic  approximation 
procedures.  Since  the  choice  of  the  weight  7t  influences  the  performance  of  SAEM 
and  SANR  drastically,  it  would  certainly  be  of  interest  to  investigate  the  optimal 
choice.  We  have  pointed  out  that  choosing  7*  small  already  at  the  early  iterations 
reduces  the  Monte  Carlo  error  fast,  but  leads  to  a slow  convergence  rate.  On  the 
other  hand,  choosing  7 1 large  at  the  start  speeds  up  the  convergence  but  eliminates 
noise  very  slowly.  A choice  of  7*  that  balances  (in  an  optimal  way)  error  reduction 
and  the  convergence  rate  would  be  desirable.  Another  problem  associated  with 
stochastic  approximation  procedures  is  the  stopping  rule.  We  have  seen  that 
common  stopping  rules,  which  base  the  decision  to  stop  on  the  relative  difference 


115 


of  two  successive  iterations,  often  lead  to  a premature  stop  of  SAEM  or  SANR. 
However,  there  exist,  to  the  best  of  our  knowledge,  no  alternative  stopping  rules  for 
stochastic  approximation  procedures. 

This  dissertation  focused  on  point  estimation  only.  All  of  the  methods  that  we 
considered  in  this  work  were  compared  on  the  basis  of  how  well  they  approximate 
the  MLE.  Based  on  this  we  concluded  that  MCEM  has  many  advantages  over 
the  remaining  methods.  We  did  not,  however,  address  the  issue  of  standard 
error  approximation.  A method  that  approximates  the  parameter  well  does  not 
necessarily  lead  to  good  approximations  for  its  standard  error.  Gueorguieva 
(1999),  for  example,  found  that  the  Monte  Carlo  standard  errors  of  MCEM  vary 
considerably.  Moreover,  she  also  found  that  standard  error  estimates  based  on 
stochastic  approximation  perform  very  well.  It  would  certainly  be  of  interest  to 
investigate  how  the  estimation  methods  considered  in  this  work  perform  with 
respect  to  approximating  the  standard  errors. 

A variety  of  other  questions  are  also  yet  unsolved.  It  would  also  be  of  interest 
to  compare  stochastic  estimation  methods  with  non-parametric  approaches, 
which  have  received  a lot  of  attention  in  the  recent  literature.  A more  thorough 
investigation  of  Quasi-Monte  Carlo  methods  and  their  applicability  to  MCEM 
would  also  be  beneficial.  And,  finally,  one  of  the  most  important  challenges  is 
to  develop  reliable,  fast  and  user-friendly  software  for  the  fitting  of  hierarchical 
models.  PROC  NLMIXED  in  SAS  is  one  first  step  into  this  direction  but  further 
extensions  are  needed.  Without  the  development  of  widely  available  software, 
hierarchical  models  will  only  remain  a topic  for  theoretical  research  without  much 
practical  appeal! 


APPENDIX  A 
DERIVATIONS 


A.l  Representation  for  the  score  function 
Recall  that  the  likelihood  in  (2.6)  is  defined  as  L(ip\  y)  = / /(y,  u;  t/>)du  = 

/( y;t/>).  Thus,  under  regularity  conditions  that  allow  for  interchanging  integration 
and  differentiation,  we  get  for  the  score  function 
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But  this  implies  that  S(t/>)  = F(t/>,  /0),  with  F defined  in  (3.15). 


A.2  Representation  for  the  Hessian  matrix 
Recall  that  the  Hessian  matrix  in  (3.7)  is  defined  as  H(i/>)  = <92  log  L(i/p  y)/ 
d^dil y = dS(ip;y)/dip’.  Using  Section  A.l,  we  get 
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where  Hq(0)  is  the  Hessian  of  the  Q-function  in  equation  (3.23).  Straightforward 
calculus  shows  that 
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so  we  can  write  for  the  term  on  the  right  hand  side  of  equation  (A.l) 
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Thus  the  Hessian  can  be  written  as 
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Now,  since 
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an  equivalent  representation  for  the  Hessian  is 
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A.3  Derivations  for  the  LMM 
Recall  that  for  the  LMM  defined  in  Section  2.1.2, 
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with  u ~ J\f( 0,  G),  independent  of  e ~ jV(0,W  x).  Standard  manipulations  of 
multivariate  normal  distributions  show  that 

u|y  ~ S),  (A.4) 

where  S = [Z'WZ  + G-1]-1  and  /x  = SZ'W(y  - X(3). 

Specifically,  in  the  OWMM  we  have,  G = a\ Im,  W = (l/ao)In,  and 
Z nxm  = Im  ® lr,  where  n = m x r.  Thus,  equation  (A.4)  implies  that  for  the 
OWMM, 

u|y~Af(rT7(y--'‘)'^rT71’”)’  (A'5) 

where  p = <To/(r<7i)  and  X = (Vi.»  ■ • • > &»>.)'■ 

In  the  following  we  derive  the  EM  parameter  update  for  the  LMM,  assuming 
that  W and  G are  known  and  that  we  are  only  interested  in  estimating  (3.  As  in 
Section  5.1  we  write  /( y|u;/3)  for  the  conditional  density  of  the  data.  It  follows 
that 

Al0g/(y|u;|a)  = x'W[y-X/J-Zu].  (A-6) 

Using  (A.4),  this  implies  that  the  EM  estimating  equations  for  (3  solve 

0 = F(/3;  (3{t) ) - X'W[y  - X/3  - Z/x«],  (A.7) 

where  /x(t)  = SZ'W(y  - X/3(t)).  Thus,  the  (t  + l)st  EM  update  satisfies 

X'W(y  - X/3(t+1))  = X'WA(y  - X/3(t)),  (A.8) 

where  A = ZXZ'W.  On  the  other  hand,  one  can  show  that  the  MLE,  /3,  satisfies 

X'W(y-X0)  = X'WA(y-X0).  (A.9) 

Subtracting  equation  (A.8)  from  (A.9)  gives 
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where  the  rate  matrix  in  the  LMM  is 


B = (X'WX)-1(X'WZSZ'WX).  (A. 11) 

A. 4 Estimating  y in  the  QWMM 
Consider  the  OWMM, 


Vij  — n u%  - 1-  Cjj,  ( i 1, . . . , ui , j 1, . . . , r),  (A. 12) 


introduced  in  Section  2.1.3.  Recall  that  e ^ and  U{  are  a random  sample  from 
N(0,atj)  and  A^O,^),  respectively.  We  assume  that  the  variance  components  a2 
and  g\  are  known  and  that  we  are  only  interested  in  estimating  the  mean  y.  The 
complete  data  likelihood  is  given  by 
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It  follows  from  (A.13)  that 
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where  u,  denotes  the  sample  mean  of  the  Uj’s  and  y = y,,  denotes  the  MLE  of  y. 
Using  equation  (A. 5)  from  Section  A. 3,  we  get  for  the  conditional  distribution  of  u . 

u. |y;/x'  ~ N (b{y  - y'),  — V (A.15) 
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where  b — 1/(1  + p)  — (rer2)/(rcr2  + <r2).  It  follows  from  (A.14)  and  (A.15)  that 
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Recall  that  the  EM  update  satisfies  F(y,y')  = 0,  with  F defined  in  equation  (3.15). 
Using  (A.  16),  we  get 
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Jj~log/(y,  u;/x) 


y;M 


F(y,  y!)  oc  (1  - b)y  + by'  - y\ 


(A.17) 
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thus,  the  EM  parameter  update  is  given  by  /r  = (1  — b)jl  + bfjf.  Using  Section  A.l 
and  equation  (A.  16),  it  follows  that  for  the  score  function 

= (A.18) 

ao 

Differentiating  (A.18),  we  get  for  the  Hessian 

H(vl)  = - b).  (A.  19) 

a0 

Now  consider  the  Monte  Carlo  approximations  to  the  Q-function  and  the  score 
function.  Using  (A. 14)  we  get 

i M 

vj  x-  log  /(y,  u(k) ; n)  = ~(fi  - u - n),  (A.20) 

M 9 ^ a° 

where  u(1), ...,  u(M)  ~ u|y;  //  and  u.  — ]T)j! =i  uk/M.  It  follows  from  equation  (A. 5) 
in  Section  A. 3 that 

u.  |y;  //  ~ b(fi  - n')  + e,  (A. 21) 

where  e jM(0,  ba^/ (Mmr)).  Recall  that  the  MCEM  update  solves  F(/x,  //)  = 0, 

with  F defined  in  (4.10).  It  follows  from  (A.20)  and  (A. 21)  that 

F(n,  n')  oc  — n + (1  - b)fi  + 6//  + e;  (A. 22) 

thus,  the  MCEM  parameter  update  is  fj,  = (1  — b)fi  + b/j,1  + e.  Using  similar 
arguments,  the  Monte  Carlo  score  function  in  (4.15)  is 

S(»)  = ™[(l-b)(ii-n)  + e\.  (A.23) 

ao 

A.5  Approximating  and  tLt 

Recall  the  situation  from  Section  5.1.1.  Since  the  unknown  parameter  f3  enters 
the  density  of  the  complete  data,  /( y,u;/3),  only  through  /(y|u;/3),  the  gradient 
and  Hessian  of  log/(y|u;  /3),  emphasizing  the  dependence  on  the  random  effects  u, 
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are 


Mu)  = ^los/(ylu;/3)  = ^2~ 


Ui  V'i 


d(3 


Mu)  = 


“ OiV(iii)g,(iii) 

9 r7log/(y|u;^)  = ~it,  77}  1 


rX,x!-  — Rtf. 


(A. 24) 
(A. 25) 


dm'  "°J  w ^ ^ fi 

Notice  that  R^g  = 0 if  g is  the  canonical  link  function,  because  in  that  case 
V{n)g'(n)  = 1.  In  non-canonical  models,  R/j  = Op(n1/2)  and  may  not  be  negligible 
relative  to  the  other  terms  in  (A. 25).  Let  W be  the  diagonal  matrix  of  iterative 
GLM  weights,  wu  = 1 / {aiV(ni)g'{g,i)2}.  We  approximate  (A.24)  and  (A. 25)  by 


Mu)  » X'W(«/(y)  - X/3  - Zu), 


Mu) 


-x'wx, 


(A. 26) 
(A. 27) 


where  we  used  a first  order  approximation  of  g(yi)  about  /x,  in  (A. 26).  We  note 
that  in  the  linear  mixed  model,  W = (l/cr2)I  and  (A. 26)  and  (A. 27)  hold  exactly. 

Let  c be  the  normalizing  factor  of  the  posterior  distribution  of  u evaluated  at 
the  MLE, 

c = J f(y,u\j3)dvL.  (A. 28) 

With  the  notation  defined  in  (A.24)  and  (A.25),  we  can  write  the  integrals  in  (5.1), 
(5.3),  (5.5)  and  (5.6)  as 


i 

j 

I 

j 


E [h1(u)h1(u)'|y;  /3]|/3=/g , 

E[  h2(u)|y;/3]|^, 

J hi(u)h1(u)'  {/(y|u;  (3)}2  f(u)du 

J [h1(u)hi(u)/  + h2(u)]  /(y|u;  /3)f(u)du 


{i+j}. 


(A. 29) 
(A. 30) 
(A.31) 

(A. 32) 


Using  the  Laplace  approximation  (De  Bruijn,  1958),  leads  to  simple  approximate 
formulae  for  (A.29),  (A.30),  (A.31)  and  (A.32).  Specifically,  we  use  arguments  very 
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similar  to  those  outlined  in  Booth  and  Hobert  (1998,  Sec. 4).  To  simplify  notation, 
we  suppress  the  dependence  on  y and  /3  and  write  the  joint  density  of  the  complete 
data,  evaluated  at  the  MLE,  as  /( y,u;/3)  = exp{Z(u)}  and  let  Z(r)(u)  = drl( u)/dur, 
for  r = 1,2.  Let  u denote  the  maximizer  of  Z(u),  satisfying  P^(u)  = 0.  Let 
W be  the  matrix  of  iterative  weights  evaluated  at  u = u.  Then  the  Laplace 
approximation  to  c in  (A. 28)  becomes 

c = J exp{Z(u)}du  « | - 27r/*2)(u)-1|1/2  exp{/(u)}.  (A. 33) 

Booth  and  Hobert  (1998,  Eq.l2&14)  show  that 

u « E[u|y;  0\  and  — Z^(u)-1  « Var[u|y;  0\  « E,  (A. 34) 

with  X defined  in  (5.7).  In  general,  these  approximations  have  a relative  error  of 
Op(n~l ) and  will  perform  well  if  n is  large  (Tierney  and  Kadane,  1986).  Notice 
that  exp{Z(u)}  = /(y|u;^)/(u),  where  /( u)  = \2nG\~ll2  exp{— u'G_1u/2}.  Using 
the  approximation  for  -(Z(2))-1  in  (A. 34)  and  omitting  constants  from  exp{/(u)} 
that  involve  the  data  only,  (A. 33)  becomes 

c«  |£|1/,2|G|-1/2  exp{— u,G_1u/2}.  (A. 35) 

Observe  now  that  for  the  expectation  in  (A. 29)  we  can  write 

i = Var  [hi(u)|y; (3]\p_p  + Hhlt*'hl,  (A.36) 

where  /x/ll  = E [hi(u)|y; 0]\p-g.  The  variance  term  on  the  right  hand  side  of 
(A.36)  is  of  order  n.  In  contrast  the  product  term  is  identically  zero  in  the  LMM 
and  generally  of  smaller  order  than  the  variance  term.  Using  (A. 26),  we  get 
Var  [hi(u)|y;  (3\  « X'WZ  Var[u|y;/3]  Z'WX  and  with  (A. 34)  we  approximate 

(A.36)  by 


i « xwzszwx 


(A.37) 
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Similarly,  using  (A. 27)  and  (A. 30),  we  obtain 

J » -X'WX,  (A. 38) 

which,  combined  with  (A. 37),  gives 

Tmcem  ~ (X'WX)"1X'WZ  S Z'WX(X'WX)-1,  (A. 39) 

which  is  exact  in  the  LMM.  Omitting  constant  terms,  the  generalized  variance  of 
the  MCEM  error  becomes 

I^mcemI  « lSl>  (A-4°) 

where  | • | denotes  the  determinant  of  a quadratic  matrix. 

We  will  use  similar  arguments  to  approximate  the  asymptotic  variance  t2ml. 
This  approximation  will  again  hold  exactly  in  the  LMM.  Using  (A. 35),  (A. 37)  and 
(A. 38),  it  follows  that  (A. 32)  can  be  approximated  by 

j « |G|-1/2|£|1/2  exp{— u'G-1u/2}  (X'WZSZ'WX  - X'WX).  (A.41) 

In  order  to  approximate  I in  (A. 31),  we  will  use  similar  arguments  but  we 
need  a slightly  different  notation  from  before.  Let  us  write  {/( y|u;/3)}2  /( u) 

= exp{[(u)}  and  define  [(r)( u),  u and  W in  the  same  way  as  before.  We  define 
the  normalizing  constant  c — f exp{Z(u)}du  and  the  density  /( u)  = exp{/(u)}/c. 
Similar  to  (A. 34)  and  (A. 35),  the  Laplace  approximation  to  the  normalizing 
constant  is  c « |S|1/2|G|~1/2  exp{-u'G_1u/2}  and  the  approximations  to  the 
mean  and  variance  (with  respect  to  the  density  /)  are 

u « Ef[u]  and  - [(2)(u)_1  « Var^u]  « S,  (A.42) 

with  S = [2  Z'WZ  + G"1]-1  in  contrast  to  (5.7).  It  follows  that  the  integral  in 
(A. 31)  becomes 

I = c ( Var/IMu)]!^  + Mi,}  , 


(A.43) 
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where  p,hl  = Ej[ hi(u)]|^=ig.  In  similar  fashion  we  find  the  approximation  of  (A.43) 
to  be 

I « |G|-1/2|S|1/2  exp{— u'G_1u/2}  {x'WZEZ'Wx}  . (A.44) 

We  have  argued  before,  that  |S|  (and  therefore  also  |S|)  are  bounded.  Fur- 
thermore, for  fixed  values  of  u,  exp{— u'G_1u/2}  is  bounded  between  0 and  1. 
Combining  (A. 41)  and  (A.44),  it  follows  that,  for  the  generalized  variance  of  the 
Monte  Carlo  error  of  SML, 

ItLJ  = o „ (|Gr) , (A.45) 

which  completes  the  proof. 


APPENDIX  B 
OX  PROGRAM  CODE 


The  following  is  the  essential  OX  program  code  to  fit  model  (7.27)  with 
automated  MCEM  and  SAEM,  using  the  same  total  simulation  amount. 
main() {//Begin  of  Program 
decl  . . . //Define  all  variables 
eps=0.0001;  delta=0. 003; //convergence  constants 
n=10 ; B=0 . 5 ; mle=l ; //model  parameters 
startM=10; //initial  Monte  Carlo  sample  size 
a=0 . 3 ; //fraction  by  which  sample  size  increased 
A=l; //weight  for  SAEM 
start=0 ; //starting  value  for  algorithm 
//***Start  MCEM 

mcem=start ; N=0 ; conseq=0 ; converge=0 ; M=startM ; N=0 ; / / initialize 
while  (converge  !=1) 

{ cmean=B* (mle-mcem) ; cvar=B/n; //conditional  mean  and  variance 
u=(rann(l ,M) . *sqrt (cvar) )+cmean; //draw  MC  sample 
oldmcem=mcem;mcem=mle-sumr (u) /M; //update  parameter 
N=N+M; //record  total  simulation  amount 
//now  we  update  MC  sample  size 

varhat=sumr ( (mle-u-mcem) . ~2) /M~2 ; chisq= (oldmcem-mcem) “2/varhat ; 
if  (chisq<  quanchi (0 . 75 , l)&&conseq  ==0){M=M+trunc(M*a) ;} 

//now  we  diagnose  convergence 
if (norm((mcem-oldmcem)/ (oldmcem+eps) )<  delta) 
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{if (conseq! =0){conseq=conseq+l ; if (conseq==3){converge=l ; }} 
else  {conseq=l ; }}else{conseq=0 ; } 

> 

//***Start  SAEM 

r=l ;gamma=A/ (A+r) ; saem=start ; curu=0; cumu=0 ;M=startM; / / initialize 
while  (r<=trunc(N/M)) 

{ cmean=B* (mle-saem) ; cvar=B/n ; //conditional  mean  and  variance 
u=(rann(l ,M) . *sqrt (cvar) )+cmean; //draw  MC  sample 

curu=gamma* (sumr (u) /M) ; //weighted  MC  sample  from  current  iteration 
cumu= (1 -gamma) *cumu; //weighted  MC  sample  from  past  iterations 
oldsaem=saem; saem=mle-(cumu+curu) ; //update  parameter 
cumu=cumu+curu; //update  past  MC  sample 
gamma=A/ (A+r) ; //increase  weight 
r=r+l ; //increase  iteration  number 

> 

>//End  of  Program 
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In  this  work  we  compare  different  methods  to  perform  maximum  likelihood 
computation  for  hierarchical  models.  Hierarchical  models  are  useful  tools  that 
enable  the  researcher  to  account  for  a natural  grouping  or  clustering  among 
the  observations.  One  disadvantage  of  this  general  class  of  models  is  that  the 
likelihood  function  typically  involves  an  intractable  integral,  making  maximum 
likelihood  estimation  complicated.  There  exist  several  deterministic  approaches  to 
approximate  the  intractable  likelihood  function.  However,  these  approaches  do  not 
always  work  well.  Stochastic  approaches  form  a popular  alternative.  Stochastic 
approaches  use  simulation  to  approximate  the  intractable  likelihood  integral.  They 
are  computer  intensive  techniques  and  have  become  increasingly  popular  with  the 
development  of  faster  and  more  powerful  computing  facilities.  In  this  dissertation 
we  compare  five  different  stochastic  estimation  methods.  We  describe  each  of  the 
five  methods,  point  out  limitations  and  difficulties  and  perform  analytical  as  well  as 
empirical  investigations  to  compare  the  efficiency  of  these  methods. 


